co3.indexer module¶
- class co3.indexer.CacheBlock(indexer, table, cols=None, where=None, distinct_on=None, order_by=None, limit=0, group_by=None, agg_on=None, index_on=None)[source]¶
Bases:
object
Wraps up a set of query parameters for a specific entity, and provides cached access to different types of “re-queries” via an associated Indexer.
Additional details
The goal here is to help build/define entities as the possibly complex transformations on the base schema that they are. For example, the Note primitive (entity) incorporates details across
files
,notes
,note_conversions
, andnote_conversion_matter
tables (defined in a single endpoint by a Composer), often needs to be selected in particular ways (via an Accessor), and results stored for fast access later on (handled by an Indexer). This pipeline can be daunting and requires too many moving parts to be handled explicitly everywhere. CacheBlocks wrap up a set of query “preferences,” exposing a simpler interface for downstream access to entities. It still allows for low-level control over re-grouping/indexing, raw hits to the actual DB, etc, but keeps things tighter and well-behaved for the Indexer.You can think of these as the Indexer’s “fingers”; they’re deployable mini-Indexes that “send back” results to the class cache, which is “broadcast” to all other instances for use when necessary.
Example usage
cb = CacheBlock() # Set up cached queries with chained params or via call: cb.where(t.notes.c.name=="name").group_by(t.note_conversions.c.format) cb() # get results # - OR - # (use strings when known) cb.where(t.notes.c.name=="name").group_by('format') cb() # get results # - OR - # (use kwargs in the call; results returned right away) cb( where=(t.notes.c.name=="name"), group_by='format' )
- class co3.indexer.Indexer(accessor, cache_select=True, cache_groupby=True)[source]¶
Bases:
object
Indexer class
Provides restricted access to an underlying Accessor to enable more efficient, superficial caching.
Cache clearing is to be handled by a wrapper class, like the Database.
Caching occurs at the class level, with indexes prefixed by table’s origin Composer. This means that cached selects/group-bys will be available regardless of the provided Accessors so long as the same Composer is used under the hood.
- cached_query(table, cols=None, where=None, distinct_on=None, order_by=None, limit=0, group_by=None, agg_on=None, index_on=None)[source]¶
Like
group_by
, but makes a full query to the Accessors tabletable_name
and caches the results. The processing performed by the GROUP BY is also cached.Update:
cached_select
andcached_group_by
now unified by a singlecached_query
method. This allows better defined GROUP BY caches, that are reactive to the full set of parameters returning the result set (and not just the table, requiring a full query).- Note: on cache keys
Cache keys are now fully stringified, as many objects are now allowed to be native SQLAlchemy objects. Indexing these objects works, but doing so will condition the cache on their memory addresses, which isn’t what we want. SQLAlchemy converts most join/column/table-like objects to reasonable strings, which will look the same regardless of instance.
Context: this became a clear issue when passing in more
order_by=<col>.desc()
. Thedesc()
causes the index to store the column in an instance-specific way, rather than an easily re-usable, canonical column reference. Each time the CoreDatabase.files() was being called, for instance, that @property would be re-evaluated, causingdesc()
to be re-initialized, and thus look different to the cache. Stringifying everything prevents this (although this could well be an indication that only a singlecache_block
should ever be returned be database properties).- Note: on access locks
A double-checked locking scheme is employed before both of the stages (select and manual group by), using the same lock. This resolves the common scenario where many threads need to look up a query in the cache, experience a cache miss, and try to do the work. This non-linearly explodes the total time to wait in my experience, so doing this only when needed saves tons of time, especially in high-congestion moments.
- classmethod group_by(rows, group_by, agg_on=None, index_on=None, return_index=False)[source]¶
Post-query “group by”-like aggregation. Creates an index over a set of columns (
group_by_cols
), and aggregates values fromagg_cols
under the groups.Rows can be dicts or mappings, and columns can be strings or SQLAlchemy columns. To ensure the right columns are being used for the operation, it’s best to pass in mappings and use SQA columns if you aren’t sure exactly how the keys look in your results (dicts can have ambiguous keys across tables with the same columns and/or different labeling schemes altogether).
TODO: add a flag that handles None’s as distinct. That is, for the group_by column(s) of interest, if rows in the provided query set have NULL values for these columns, treat all such rows as their “own group” and return them alongside the grouped/aggregated ones. This is behavior desired by something like FTSManager.recreate(), which wants to bundle up conversions for blocks (effectively grouping by blocks.name and link.id, aggregating on block_conversions.format, then flattening). You could either do this, or as the caller just make sure to first filter the result set before grouping (e.g., splitting the NULL-valued rows from those that are well-defined), and then stitching the two sets back together afterward.
Multi-dim update:
group_by: can be a tuple of tuples of columns. Each inner tuple is a nested “group by index” in the group by index