A catalog brain is an object that has cached attributes, used to avoid traversing to an actual content object (e.g. in Plone) -- almost always more expensive and costly.
I am writing a Plone add-on that needs to enumerate ids of folder contents. I am using the catalog to avoid using the CMF listFolderContents() method, which traverses to each contained content object to check permissions. Plone uses the catalog for its own navigation, querying for contents visible to a user by path and filtering with a keyword index called 'allowedRolesAndUsers'.
These result sequences are lazy. These lazy sequences are objects backed by integer record ids (RID) that are used to construct a catalog brain or get a physical path to a content object just-in-time. A catalog brain is not a persistent object, rather it is created as needed from a tuple of content metadata stored in a BTree of the catalog.
If you have the need, you can avoid constructing a brain altogether. The LazyMap objects have a _seq attribute that gives you a list of RIDs. I had a hunch that creating a complete metadata brain just to get the catalog's id might be overkill. There are two alternatives I attempted to explore:
(1) Get the data raw tuple for each object and just get the identifier of each object without the overhead of brain object construction,
(2) Get the path to the object from the paths BTree stored by the catalog, split it and get the last element as the id/name of the object.
(3) Just do the conventional thing and enumerate the catalog brains from the result.
I wrote a simple benchmark to get some averages, and ran using a site that had at the time a result of 321 document/page objects to test a catalog result. Option #2 was a winner in my tests; mean time per-object to get id (or path) for each RID from portal_catalog._catalog.paths was 2.8 times faster than loading a brain, given a sufficient number of results.
Caveats:
- Option #1 was superior only on the guarantee that all objects were cached. There is a fair chance of this, but option #2 seems lower-hanging fruit for a win over just getting catalog brains.
- Using brains may result in simpler code and is more idiomatic. You might need to have folders or results counted in tens-of-thousands to make this difference something that is measured in tenths-of-seconds.
- The query here is not filtering on allowedRolesAndUsers index, but the actual cost of the query itself is constant across all three options.
- There is an untested assumption made based on previous experience that a catalog-based result is quicker than (for the single-folder case) listFolderContents(), but slower than contentIds(). If you don't care about permissions in the single-folder case, use contentIds().
Benchmark run-script and output below.
from zope.component.hooks import getSite from Products.ZCatalog.Lazy import LazyCat NPASS = 5 SITENAME = 'Plone' USER = 'admin' class CatalogHelper(object): def __init__(self): self.site = getSite() self.catalog = self.site.portal_catalog._catalog def result_ids_from_data(self, lazy): """ Given a lazy sequence, get resulting local ids for respective results from data-tables. """ #if isinstance(lazy, LazyCat): # return [] # empty result # get rids from _seq (or in already populated brains, _data): rids = getattr(lazy, '_seq', None) #if rids is None: # print 'No _seq !' # rids = [b.getRID() for b in lazy._data] idx = self.catalog.schema['getId'] _local_id = lambda rid: self.catalog.data[rid][idx] return map(_local_id, rids) def result_ids_from_path(self, lazy): catalog = self.catalog rids = getattr(lazy, '_seq', None) if rids is None: print 'No _seq !' rids = [b.getRID() for b in lazy._data] _local_id = lambda rid: catalog.paths[rid].split('/')[-1] return map(_local_id, rids) def run_fetch_ids(query, title, fn, cold=False): site = getSite() app = site.__parent__ catalog = site.portal_catalog results = [] for i in xrange(NPASS): if cold: app._p_jar.cacheMinimize() app._p_jar.sync() start = time.time() result = catalog.searchResults(query) querytime = time.time() - start names = fn(result) datatime = (time.time() - start) - querytime results.append((querytime, datatime)) # print result averages: avg = lambda s: sum(s)/float(len(s)) aquery = avg(zip(*results)[0]) adata = avg(zip(*results)[1]) print 'Query, data fetch (%s) averages total %s seconds' % (title, aquery + adata) print '\tMean query time: %.6f, Mean data time: %.6f' % (aquery, adata) print '\tPer 1000: %.4f, %.4f' % ( (aquery / len(result)) * 1000, (adata / len(result)) * 1000 ) if __name__ == '__main__' and 'app' in locals(): import sys import time from zope.component.hooks import setSite from AccessControl.SecurityManagement import newSecurityManager user = app.acl_users.getUser(USER) newSecurityManager(None, user) app._p_jar.cacheMinimize() site = app[SITENAME] setSite(site) q = {'portal_type': 'Document', 'sort_on': 'getObjPositionInParent'} helper = CatalogHelper() # from catalog paths BTree, cold: run_fetch_ids( q, 'from paths, uncached', helper.result_ids_from_path, True ) # from catalog paths BTree, warm cache: run_fetch_ids( q, 'from paths, warmed cache', helper.result_ids_from_path, False ) app._p_jar.cacheMinimize() # from raw metadata, cold: run_fetch_ids( q, 'from raw metadata, uncached', helper.result_ids_from_data, True ) # helper, warm cache: run_fetch_ids( q, 'from raw metadata, warmed cache', helper.result_ids_from_data, False ) app._p_jar.cacheMinimize() # brains, cold cache: run_fetch_ids( q, 'enumerated brains, uncached', lambda r: [b.getId for b in r], True ) # brains, warm cache: run_fetch_ids( q, 'enumerated brains, warmed cache', lambda r: [b.getId for b in r], False )
Result over 321 results of a catalog query:
Query, data fetch (from paths, uncached) averages total 0.477873373032 seconds Mean query time: 0.477127, Mean data time: 0.000746 Per 1000: 1.4864, 0.0023 Query, data fetch (from paths, warmed cache) averages total 0.071058177948 seconds Mean query time: 0.070358, Mean data time: 0.000701 Per 1000: 0.2192, 0.0022 Query, data fetch (from raw metadata, uncached) averages total 0.967933130264 seconds Mean query time: 0.508597, Mean data time: 0.459336 Per 1000: 1.5844, 1.4310 Query, data fetch (from raw metadata, warmed cache) averages total 0.0698463916779 seconds Mean query time: 0.069370, Mean data time: 0.000476 Per 1000: 0.2161, 0.0015 Query, data fetch (enumerated brains, uncached) averages total 0.938056659698 seconds Mean query time: 0.432158, Mean data time: 0.505899 Per 1000: 1.3463, 1.5760 Query, data fetch (enumerated brains, warmed cache) averages total 0.0716955184937 seconds Mean query time: 0.069604, Mean data time: 0.002091 Per 1000: 0.2168, 0.0065