Tuesday, July 2, 2013

Hey, hey, AMA! How many codes did you hoard today?

I do not know what all sufficient conditions are for putting more transparency in health care costs in the United States, and reconciling gaps between fee-for-service and cost/value reconciliation based on outcomes.

But as someone keenly interested in controlled vocabularies, I can think of an obvious and necessary condition to giving tech, statisticians, and healthcare informatics folks tools to help contribute to keep costs under control: free CPT and CDT. These codes are licensed regimes -- money makers for the American Medical Association and the American Dental Association. Americans: any healthcare services is billed to you or your insurance using these codes, and the fee-for-service system has a vested monopoly profit interest in making this stuff obscure.

I do not want to knock on fee-for-service models; they are not the problem, though they can contribute (Medicare, for example, has good muscle in price controls on fee-for-service, but that does not check provision of unnecessary services not counseled by evidence-based medicine). We cannot ever really reconcile having transparency and consumer accountability for the per-service, billed a-la-carte, model when the medical billing codes for services provides are hidden behind the licensing regime of some respective monopoly.

This stuff might be painfully obvious -- and seem not copyrightable as mere fact associations (codes to labels), but the bar has historically been set really low by the US Supreme Court for "Originality" in copyright (see Feist v. Rural). So if there is no litigative solution to free these codes from their guilded overlords, my reading says we are stuck with what we have until a political, legislative solution is crafted. Congress needs a plan to free medical coding from the monopolies of the AMA and ADA -- I have to think that there must be a bipartisan way to do this (without getting mired in Congress' other axe grinding over other healthcare policy matters)?   At minimum, if they public is paying for the bulk of the continued use of these regimes, these should be open-source or public domain.

Disclaimer: I work for an academic healthcare organization; my opinions are my own and do not reflect the opinions of anyone else or any organization to which I may be associated.

Friday, June 7, 2013

Quickly getting only ids (or paths) from a Zope2 ZCatalog search

Abstract: In Zope2, catalog searches return sequences of metadata objects called brains.  A simple benchmark shows that if you only need id or path to objects resulting from a search, using the paths BTree of the catalog is likely a superior-performing alternative to loading brains, without regard to whether the persisted metadata used by brains are in a warm cache or not.

A catalog brain is an object that has cached attributes, used to avoid traversing to an actual content object (e.g. in Plone) -- almost always more expensive and costly.

I am writing a Plone add-on that needs to enumerate ids of folder contents.  I am using the catalog to avoid using the CMF listFolderContents() method, which traverses to each contained content object to check permissions.  Plone uses the catalog for its own navigation, querying for contents visible to a user by path and filtering with a keyword index called 'allowedRolesAndUsers'.

These result sequences are lazy.   These lazy sequences are objects backed by integer record ids (RID) that are used to construct a catalog brain or get a physical path to a content object just-in-time.  A catalog brain is not a persistent object, rather it is created as needed from a tuple of content metadata stored in a BTree of the catalog.

If you have the need, you can avoid constructing a brain altogether.  The LazyMap objects have a _seq attribute that gives you a list of RIDs.  I had a hunch that creating a complete metadata brain just to get the catalog's id might be overkill.  There are two alternatives I attempted to explore:

(1) Get the data raw tuple for each object and just get the identifier of each object without the overhead of brain object construction,

(2) Get the path to the object from the paths BTree stored by the catalog, split it and get the last element as the id/name of the object.

(3) Just do the conventional thing and enumerate the catalog brains from the result.

I wrote a simple benchmark to get some averages, and ran using a site that had at the time a result of 321 document/page objects to test a catalog result.  Option #2 was a winner in my tests; mean time per-object to get id (or path) for each RID from portal_catalog._catalog.paths was 2.8 times faster than loading a brain, given a sufficient number of results.

  • Option #1 was superior only on the guarantee that all objects were cached.  There is a fair chance of this, but option #2 seems lower-hanging fruit for a win over just getting catalog brains.
  • Using brains may result in simpler code and is more idiomatic.  You  might need to have folders or results counted in tens-of-thousands to make this difference something that is measured in tenths-of-seconds.
  • The query here is not filtering on allowedRolesAndUsers index, but the actual cost of the query itself is constant across all three options.  
  • There is an untested assumption made based on previous experience that a catalog-based result is quicker than (for the single-folder case) listFolderContents(), but slower than contentIds().  If you don't care about permissions in the single-folder case, use contentIds().

Benchmark run-script and output below.

from zope.component.hooks import getSite
from Products.ZCatalog.Lazy import LazyCat

SITENAME = 'Plone'
USER = 'admin'

class CatalogHelper(object):
    def __init__(self):
        self.site = getSite()
        self.catalog = self.site.portal_catalog._catalog
    def result_ids_from_data(self, lazy):
        Given a lazy sequence, get resulting local ids for
        respective results from data-tables.
        #if isinstance(lazy, LazyCat):
        #    return []  # empty result
        # get rids from _seq (or in already populated brains, _data):
        rids = getattr(lazy, '_seq', None)
        #if rids is None:
        #    print 'No _seq !'
        #    rids = [b.getRID() for b in lazy._data]
        idx = self.catalog.schema['getId']
        _local_id = lambda rid: self.catalog.data[rid][idx]
        return map(_local_id, rids)
    def result_ids_from_path(self, lazy):
        catalog = self.catalog
        rids = getattr(lazy, '_seq', None)
        if rids is None:
            print 'No _seq !'
            rids = [b.getRID() for b in lazy._data]
        _local_id = lambda rid: catalog.paths[rid].split('/')[-1]
        return map(_local_id, rids)

def run_fetch_ids(query, title, fn, cold=False):
    site = getSite()
    app = site.__parent__
    catalog = site.portal_catalog
    results = []
    for i in xrange(NPASS):
        if cold:
        start = time.time()
        result = catalog.searchResults(query)
        querytime = time.time() - start
        names = fn(result)
        datatime = (time.time() - start) - querytime
        results.append((querytime, datatime))
    # print result averages:
    avg = lambda s: sum(s)/float(len(s))
    aquery = avg(zip(*results)[0])
    adata = avg(zip(*results)[1])
    print 'Query, data fetch (%s) averages total %s seconds' % (title, aquery + adata)
    print '\tMean query time: %.6f, Mean data time: %.6f' % (aquery, adata)
    print '\tPer 1000: %.4f, %.4f' % (
        (aquery / len(result)) * 1000,
        (adata / len(result)) * 1000

if __name__ == '__main__' and 'app' in locals():
    import sys
    import time
    from zope.component.hooks import setSite
    from AccessControl.SecurityManagement import newSecurityManager
    user = app.acl_users.getUser(USER)
    newSecurityManager(None, user)
    site = app[SITENAME]
    q = {'portal_type': 'Document', 'sort_on': 'getObjPositionInParent'}
    helper = CatalogHelper()
    # from catalog paths BTree, cold:
        'from paths, uncached',
    # from catalog paths BTree, warm cache:
        'from paths, warmed cache',
    # from raw metadata, cold:
        'from raw metadata, uncached',
    # helper, warm cache:
        'from raw metadata, warmed cache',
    # brains, cold cache:
        'enumerated brains, uncached',
        lambda r: [b.getId for b in r],
    # brains, warm cache:
        'enumerated brains, warmed cache',
        lambda r: [b.getId for b in r],

Result over 321 results of a catalog query:

Query, data fetch (from paths, uncached) averages total 0.477873373032 seconds
 Mean query time: 0.477127, Mean data time: 0.000746
 Per 1000: 1.4864, 0.0023
Query, data fetch (from paths, warmed cache) averages total 0.071058177948 seconds
 Mean query time: 0.070358, Mean data time: 0.000701
 Per 1000: 0.2192, 0.0022
Query, data fetch (from raw metadata, uncached) averages total 0.967933130264 seconds
 Mean query time: 0.508597, Mean data time: 0.459336
 Per 1000: 1.5844, 1.4310
Query, data fetch (from raw metadata, warmed cache) averages total 0.0698463916779 seconds
 Mean query time: 0.069370, Mean data time: 0.000476
 Per 1000: 0.2161, 0.0015
Query, data fetch (enumerated brains, uncached) averages total 0.938056659698 seconds
 Mean query time: 0.432158, Mean data time: 0.505899
 Per 1000: 1.3463, 1.5760
Query, data fetch (enumerated brains, warmed cache) averages total 0.0716955184937 seconds
 Mean query time: 0.069604, Mean data time: 0.002091
 Per 1000: 0.2168, 0.0065