Friday, June 7, 2013

Quickly getting only ids (or paths) from a Zope2 ZCatalog search

Abstract: In Zope2, catalog searches return sequences of metadata objects called brains.  A simple benchmark shows that if you only need id or path to objects resulting from a search, using the paths BTree of the catalog is likely a superior-performing alternative to loading brains, without regard to whether the persisted metadata used by brains are in a warm cache or not.

A catalog brain is an object that has cached attributes, used to avoid traversing to an actual content object (e.g. in Plone) -- almost always more expensive and costly.

I am writing a Plone add-on that needs to enumerate ids of folder contents.  I am using the catalog to avoid using the CMF listFolderContents() method, which traverses to each contained content object to check permissions.  Plone uses the catalog for its own navigation, querying for contents visible to a user by path and filtering with a keyword index called 'allowedRolesAndUsers'.

These result sequences are lazy.   These lazy sequences are objects backed by integer record ids (RID) that are used to construct a catalog brain or get a physical path to a content object just-in-time.  A catalog brain is not a persistent object, rather it is created as needed from a tuple of content metadata stored in a BTree of the catalog.

If you have the need, you can avoid constructing a brain altogether.  The LazyMap objects have a _seq attribute that gives you a list of RIDs.  I had a hunch that creating a complete metadata brain just to get the catalog's id might be overkill.  There are two alternatives I attempted to explore:

(1) Get the data raw tuple for each object and just get the identifier of each object without the overhead of brain object construction,

(2) Get the path to the object from the paths BTree stored by the catalog, split it and get the last element as the id/name of the object.

(3) Just do the conventional thing and enumerate the catalog brains from the result.

I wrote a simple benchmark to get some averages, and ran using a site that had at the time a result of 321 document/page objects to test a catalog result.  Option #2 was a winner in my tests; mean time per-object to get id (or path) for each RID from portal_catalog._catalog.paths was 2.8 times faster than loading a brain, given a sufficient number of results.

Caveats:
  • Option #1 was superior only on the guarantee that all objects were cached.  There is a fair chance of this, but option #2 seems lower-hanging fruit for a win over just getting catalog brains.
  • Using brains may result in simpler code and is more idiomatic.  You  might need to have folders or results counted in tens-of-thousands to make this difference something that is measured in tenths-of-seconds.
  • The query here is not filtering on allowedRolesAndUsers index, but the actual cost of the query itself is constant across all three options.  
  • There is an untested assumption made based on previous experience that a catalog-based result is quicker than (for the single-folder case) listFolderContents(), but slower than contentIds().  If you don't care about permissions in the single-folder case, use contentIds().

Benchmark run-script and output below.


from zope.component.hooks import getSite
from Products.ZCatalog.Lazy import LazyCat

NPASS = 5
SITENAME = 'Plone'
USER = 'admin'


class CatalogHelper(object):
    
    def __init__(self):
        self.site = getSite()
        self.catalog = self.site.portal_catalog._catalog
    
    def result_ids_from_data(self, lazy):
        """
        Given a lazy sequence, get resulting local ids for
        respective results from data-tables.
        """
        #if isinstance(lazy, LazyCat):
        #    return []  # empty result
        # get rids from _seq (or in already populated brains, _data):
        rids = getattr(lazy, '_seq', None)
        #if rids is None:
        #    print 'No _seq !'
        #    rids = [b.getRID() for b in lazy._data]
        idx = self.catalog.schema['getId']
        _local_id = lambda rid: self.catalog.data[rid][idx]
        return map(_local_id, rids)
    
    def result_ids_from_path(self, lazy):
        catalog = self.catalog
        rids = getattr(lazy, '_seq', None)
        if rids is None:
            print 'No _seq !'
            rids = [b.getRID() for b in lazy._data]
        _local_id = lambda rid: catalog.paths[rid].split('/')[-1]
        return map(_local_id, rids)


def run_fetch_ids(query, title, fn, cold=False):
    site = getSite()
    app = site.__parent__
    catalog = site.portal_catalog
    results = []
    for i in xrange(NPASS):
        if cold:
            app._p_jar.cacheMinimize()
            app._p_jar.sync()
        start = time.time()
        result = catalog.searchResults(query)
        querytime = time.time() - start
        names = fn(result)
        datatime = (time.time() - start) - querytime
        results.append((querytime, datatime))
    # print result averages:
    avg = lambda s: sum(s)/float(len(s))
    aquery = avg(zip(*results)[0])
    adata = avg(zip(*results)[1])
    print 'Query, data fetch (%s) averages total %s seconds' % (title, aquery + adata)
    print '\tMean query time: %.6f, Mean data time: %.6f' % (aquery, adata)
    print '\tPer 1000: %.4f, %.4f' % (
        (aquery / len(result)) * 1000,
        (adata / len(result)) * 1000
        )


if __name__ == '__main__' and 'app' in locals():
    import sys
    import time
    from zope.component.hooks import setSite
    from AccessControl.SecurityManagement import newSecurityManager
    user = app.acl_users.getUser(USER)
    newSecurityManager(None, user)
    app._p_jar.cacheMinimize()
    site = app[SITENAME]
    setSite(site)
    q = {'portal_type': 'Document', 'sort_on': 'getObjPositionInParent'}
    helper = CatalogHelper()
    # from catalog paths BTree, cold:
    run_fetch_ids(
        q,
        'from paths, uncached',
        helper.result_ids_from_path,
        True
        )
    # from catalog paths BTree, warm cache:
    run_fetch_ids(
        q,
        'from paths, warmed cache',
        helper.result_ids_from_path,
        False
        )
    app._p_jar.cacheMinimize()
    # from raw metadata, cold:
    run_fetch_ids(
        q,
        'from raw metadata, uncached',
        helper.result_ids_from_data,
        True
        )
    # helper, warm cache:
    run_fetch_ids(
        q,
        'from raw metadata, warmed cache',
        helper.result_ids_from_data,
        False
        )
    app._p_jar.cacheMinimize()
    # brains, cold cache:
    run_fetch_ids(
        q,
        'enumerated brains, uncached',
        lambda r: [b.getId for b in r],
        True
        )
    # brains, warm cache:
    run_fetch_ids(
        q,
        'enumerated brains, warmed cache',
        lambda r: [b.getId for b in r],
        False
        )

Result over 321 results of a catalog query:

Query, data fetch (from paths, uncached) averages total 0.477873373032 seconds
 Mean query time: 0.477127, Mean data time: 0.000746
 Per 1000: 1.4864, 0.0023
Query, data fetch (from paths, warmed cache) averages total 0.071058177948 seconds
 Mean query time: 0.070358, Mean data time: 0.000701
 Per 1000: 0.2192, 0.0022
Query, data fetch (from raw metadata, uncached) averages total 0.967933130264 seconds
 Mean query time: 0.508597, Mean data time: 0.459336
 Per 1000: 1.5844, 1.4310
Query, data fetch (from raw metadata, warmed cache) averages total 0.0698463916779 seconds
 Mean query time: 0.069370, Mean data time: 0.000476
 Per 1000: 0.2161, 0.0015
Query, data fetch (enumerated brains, uncached) averages total 0.938056659698 seconds
 Mean query time: 0.432158, Mean data time: 0.505899
 Per 1000: 1.3463, 1.5760
Query, data fetch (enumerated brains, warmed cache) averages total 0.0716955184937 seconds
 Mean query time: 0.069604, Mean data time: 0.002091
 Per 1000: 0.2168, 0.0065