Sunday, February 22, 2015

Superfish, Snooping, Snowden: threads in the same rope.

Note: I've been meaning to get back to blogging; my intent was to do that by blogging about code-review tricks for client acceptance testing of projects written in dynamic languages.  But instead, it is InfoSec politics that brings me back (disillusioning, but not surprising).

Image credit:
Flickr user Imamon via Wikimedia Commons
The Edward Snowden documentary just won an Oscar; the Superfish snafu at Lenovo hit the fan; all within a week.  Here, there is reasonable juxtaposition beyond coincidence.  Both contexts reflect an unspoken asymmetry of power that threatens the autonomy of people to truly self-govern.

There are sensible common threads between Snowden's motives disclosing American and British security (NSA/GCHQ) overreach and the recent controversy over spyware known as "Superfish".  These threads weave with others in the hangman's rope around the neck of our democracy.

Superfish, as used by Lenovo, installed a devilish man-in-the-middle (MITM) attack vector on many thousands of PCs from the factory.  This eliminates all real encryption safety for many unsuspecting victims doing normal everyday things like email and online banking.  Superfish means is that Lenovo PC users on any network could have even lame script kiddies sitting near you at Starbucks stealing your bank credentials.  That wasn't by design, but the people who created this are not exactly NSA-stealthy™ (to be fair, there are US-based firewall vendors who also promote TLS MITM features hijacking PCs inside an organization -- slimy but usually legal).

Superfish was originally designed to help multi-national corporations spy on you the consumer.  Both Lenovo and the Superfish authors even tried something a line suggesting euphemistically that "getting spied on is good for you" (before Lenovo back-peddled after the US Department of Homeland Security warned Lenovo users about this).

Snowden's critique of the US/UK security apparatus great overreach and the InfoSec community's reaction to Superfish are similar in this way: we do and should react with deep suspicion to large and powerful organizations spying on us, without any accountability, with paternalistic defense. 

When these organizations are corporate, we can use the market to show our disgust.  If the power asymmetry in the market prevents effective resolution, we often have tort law and statutory protections to fall back on.  This is grease for a working free market, even one where the vast majority of consumers are ill-informed and out-gunned when it comes to their own information security and safety online.  And in a civic sense, we should have similar (or better) checks and balances for institutions of government spying on their own citizens (directly, indirectly, or in some fuzzy gray area).  The problem is that there has been no effective oversight on either need, constitutionality, or effectiveness of many of the NSA programs.

And we have a FISA court unilaterally appointed by John Roberts, not a singing vote of confidence that the FISA courts alone are sufficient oversight.  Perhaps such worry about oversight would be academic, but in "real life" we have seen abuse: by DOJ of secrets doctrine in civil cases where it is clearly unwarranted; we have seen journalists (like Laura Poitras) harassed in the course of their Constitutionally-protected public service for years.

Some of what Snowden did could be brought into question in terms of necessity or effectiveness.  But we needed Snowden.  Our political machine had no actions or discourse of substance after Binney or Klein revelations.  The reason we needed Snowden's "punch to the gut" was precisely the absence of accountability to the public.  We also understand there are correlations between spying and censorship, both by observing nation-states elsewhere engaged in both, and by looking reflexively on our sense of freedom.

The British people needed Snowden to head off the repeating specter of the Snooper's Charter (and we Americans should be wary of how cozy Obama has been to Mr. Cameron on this specific matter after Charlie Hebdo attacks, keeping James Comey's attacks on smartphone encryption in mind).  Let's not let ignorant politicians outlaw effective encryption (as if that really could work, hah!); we need more (and, long-story, more effective) encryption -- not less.

Already having the context of Citizens United, one-billion dollars of dirty money from powerful interests already buying pieces of the 2016 election -- we should have this sense that powerful interests have eroded some semblance of our democracy.  Do we let the MNCs and APTs / TLAs (state-spy-agencies) spy on us without impunity, oversight, recourse, and inclusive discussion? 

Are we resigned in all of these contexts to letting the powerful treat democracy like cattle for slaughter?  If so, we might as well throw in the towel on democracy and embrace the notion that We the People can no longer be the "popular" in popular sovereignty: we are, online and IRL, totally freakin' 0wn3d.

Tuesday, July 2, 2013

Hey, hey, AMA! How many codes did you hoard today?

I do not know what all sufficient conditions are for putting more transparency in health care costs in the United States, and reconciling gaps between fee-for-service and cost/value reconciliation based on outcomes.

But as someone keenly interested in controlled vocabularies, I can think of an obvious and necessary condition to giving tech, statisticians, and healthcare informatics folks tools to help contribute to keep costs under control: free CPT and CDT. These codes are licensed regimes -- money makers for the American Medical Association and the American Dental Association. Americans: any healthcare services is billed to you or your insurance using these codes, and the fee-for-service system has a vested monopoly profit interest in making this stuff obscure.

I do not want to knock on fee-for-service models; they are not the problem, though they can contribute (Medicare, for example, has good muscle in price controls on fee-for-service, but that does not check provision of unnecessary services not counseled by evidence-based medicine). We cannot ever really reconcile having transparency and consumer accountability for the per-service, billed a-la-carte, model when the medical billing codes for services provides are hidden behind the licensing regime of some respective monopoly.

This stuff might be painfully obvious -- and seem not copyrightable as mere fact associations (codes to labels), but the bar has historically been set really low by the US Supreme Court for "Originality" in copyright (see Feist v. Rural). So if there is no litigative solution to free these codes from their guilded overlords, my reading says we are stuck with what we have until a political, legislative solution is crafted. Congress needs a plan to free medical coding from the monopolies of the AMA and ADA -- I have to think that there must be a bipartisan way to do this (without getting mired in Congress' other axe grinding over other healthcare policy matters)?   At minimum, if they public is paying for the bulk of the continued use of these regimes, these should be open-source or public domain.

Disclaimer: I work for an academic healthcare organization; my opinions are my own and do not reflect the opinions of anyone else or any organization to which I may be associated.

Friday, June 7, 2013

Quickly getting only ids (or paths) from a Zope2 ZCatalog search

Abstract: In Zope2, catalog searches return sequences of metadata objects called brains.  A simple benchmark shows that if you only need id or path to objects resulting from a search, using the paths BTree of the catalog is likely a superior-performing alternative to loading brains, without regard to whether the persisted metadata used by brains are in a warm cache or not.

A catalog brain is an object that has cached attributes, used to avoid traversing to an actual content object (e.g. in Plone) -- almost always more expensive and costly.

I am writing a Plone add-on that needs to enumerate ids of folder contents.  I am using the catalog to avoid using the CMF listFolderContents() method, which traverses to each contained content object to check permissions.  Plone uses the catalog for its own navigation, querying for contents visible to a user by path and filtering with a keyword index called 'allowedRolesAndUsers'.

These result sequences are lazy.   These lazy sequences are objects backed by integer record ids (RID) that are used to construct a catalog brain or get a physical path to a content object just-in-time.  A catalog brain is not a persistent object, rather it is created as needed from a tuple of content metadata stored in a BTree of the catalog.

If you have the need, you can avoid constructing a brain altogether.  The LazyMap objects have a _seq attribute that gives you a list of RIDs.  I had a hunch that creating a complete metadata brain just to get the catalog's id might be overkill.  There are two alternatives I attempted to explore:

(1) Get the data raw tuple for each object and just get the identifier of each object without the overhead of brain object construction,

(2) Get the path to the object from the paths BTree stored by the catalog, split it and get the last element as the id/name of the object.

(3) Just do the conventional thing and enumerate the catalog brains from the result.

I wrote a simple benchmark to get some averages, and ran using a site that had at the time a result of 321 document/page objects to test a catalog result.  Option #2 was a winner in my tests; mean time per-object to get id (or path) for each RID from portal_catalog._catalog.paths was 2.8 times faster than loading a brain, given a sufficient number of results.

Caveats:
  • Option #1 was superior only on the guarantee that all objects were cached.  There is a fair chance of this, but option #2 seems lower-hanging fruit for a win over just getting catalog brains.
  • Using brains may result in simpler code and is more idiomatic.  You  might need to have folders or results counted in tens-of-thousands to make this difference something that is measured in tenths-of-seconds.
  • The query here is not filtering on allowedRolesAndUsers index, but the actual cost of the query itself is constant across all three options.  
  • There is an untested assumption made based on previous experience that a catalog-based result is quicker than (for the single-folder case) listFolderContents(), but slower than contentIds().  If you don't care about permissions in the single-folder case, use contentIds().

Benchmark run-script and output below.


from zope.component.hooks import getSite
from Products.ZCatalog.Lazy import LazyCat

NPASS = 5
SITENAME = 'Plone'
USER = 'admin'


class CatalogHelper(object):
    
    def __init__(self):
        self.site = getSite()
        self.catalog = self.site.portal_catalog._catalog
    
    def result_ids_from_data(self, lazy):
        """
        Given a lazy sequence, get resulting local ids for
        respective results from data-tables.
        """
        #if isinstance(lazy, LazyCat):
        #    return []  # empty result
        # get rids from _seq (or in already populated brains, _data):
        rids = getattr(lazy, '_seq', None)
        #if rids is None:
        #    print 'No _seq !'
        #    rids = [b.getRID() for b in lazy._data]
        idx = self.catalog.schema['getId']
        _local_id = lambda rid: self.catalog.data[rid][idx]
        return map(_local_id, rids)
    
    def result_ids_from_path(self, lazy):
        catalog = self.catalog
        rids = getattr(lazy, '_seq', None)
        if rids is None:
            print 'No _seq !'
            rids = [b.getRID() for b in lazy._data]
        _local_id = lambda rid: catalog.paths[rid].split('/')[-1]
        return map(_local_id, rids)


def run_fetch_ids(query, title, fn, cold=False):
    site = getSite()
    app = site.__parent__
    catalog = site.portal_catalog
    results = []
    for i in xrange(NPASS):
        if cold:
            app._p_jar.cacheMinimize()
            app._p_jar.sync()
        start = time.time()
        result = catalog.searchResults(query)
        querytime = time.time() - start
        names = fn(result)
        datatime = (time.time() - start) - querytime
        results.append((querytime, datatime))
    # print result averages:
    avg = lambda s: sum(s)/float(len(s))
    aquery = avg(zip(*results)[0])
    adata = avg(zip(*results)[1])
    print 'Query, data fetch (%s) averages total %s seconds' % (title, aquery + adata)
    print '\tMean query time: %.6f, Mean data time: %.6f' % (aquery, adata)
    print '\tPer 1000: %.4f, %.4f' % (
        (aquery / len(result)) * 1000,
        (adata / len(result)) * 1000
        )


if __name__ == '__main__' and 'app' in locals():
    import sys
    import time
    from zope.component.hooks import setSite
    from AccessControl.SecurityManagement import newSecurityManager
    user = app.acl_users.getUser(USER)
    newSecurityManager(None, user)
    app._p_jar.cacheMinimize()
    site = app[SITENAME]
    setSite(site)
    q = {'portal_type': 'Document', 'sort_on': 'getObjPositionInParent'}
    helper = CatalogHelper()
    # from catalog paths BTree, cold:
    run_fetch_ids(
        q,
        'from paths, uncached',
        helper.result_ids_from_path,
        True
        )
    # from catalog paths BTree, warm cache:
    run_fetch_ids(
        q,
        'from paths, warmed cache',
        helper.result_ids_from_path,
        False
        )
    app._p_jar.cacheMinimize()
    # from raw metadata, cold:
    run_fetch_ids(
        q,
        'from raw metadata, uncached',
        helper.result_ids_from_data,
        True
        )
    # helper, warm cache:
    run_fetch_ids(
        q,
        'from raw metadata, warmed cache',
        helper.result_ids_from_data,
        False
        )
    app._p_jar.cacheMinimize()
    # brains, cold cache:
    run_fetch_ids(
        q,
        'enumerated brains, uncached',
        lambda r: [b.getId for b in r],
        True
        )
    # brains, warm cache:
    run_fetch_ids(
        q,
        'enumerated brains, warmed cache',
        lambda r: [b.getId for b in r],
        False
        )

Result over 321 results of a catalog query:

Query, data fetch (from paths, uncached) averages total 0.477873373032 seconds
 Mean query time: 0.477127, Mean data time: 0.000746
 Per 1000: 1.4864, 0.0023
Query, data fetch (from paths, warmed cache) averages total 0.071058177948 seconds
 Mean query time: 0.070358, Mean data time: 0.000701
 Per 1000: 0.2192, 0.0022
Query, data fetch (from raw metadata, uncached) averages total 0.967933130264 seconds
 Mean query time: 0.508597, Mean data time: 0.459336
 Per 1000: 1.5844, 1.4310
Query, data fetch (from raw metadata, warmed cache) averages total 0.0698463916779 seconds
 Mean query time: 0.069370, Mean data time: 0.000476
 Per 1000: 0.2161, 0.0015
Query, data fetch (enumerated brains, uncached) averages total 0.938056659698 seconds
 Mean query time: 0.432158, Mean data time: 0.505899
 Per 1000: 1.3463, 1.5760
Query, data fetch (enumerated brains, warmed cache) averages total 0.0716955184937 seconds
 Mean query time: 0.069604, Mean data time: 0.002091
 Per 1000: 0.2168, 0.0065