Consider performance improvements for `epix_slide`, `as_of`

Currently, `epix_slide` and `as_of` have reasonable performance for pseudoprospective analysis of some state-level forecasters, such as those predicting 23 quantiles using `rq`.  However, for more efficient forecasters (using `rq.fit`, `lm`, or `lm.fit`) or simpler operations such as 7-day averaging, the overhead becomes noticeable or can even dwarf the actual operation's computation time.  Some ideas for improvements:
- Rather than use one huge `data.table`, stratify by some/all non-version key columns, or chunks of non-version key column values.  Drop parts of the key in these table chunks if it helps with memory/speed.  Filtering based on key-without-version values can be done in another data structure that holds all these table chunks together.  The goal would be to (a) try to avoid any searching for the non-version part of the key that might be done for every non-version key value as part of `as_of`, and (b) try to avoid extra work in the version search from operations on non-version parts of the key.  Stratifying too finely could be counterproductive due to (a) overhead looping over chunks in R and FFI (if looping is done in R), and (b) reduced spatial locality.
- [Store key columns internally as integers only, with auxiliary data structures to convert between non-integer classes and internal integer representations?  (Snowflake schema / interning.)]  
- Customize `$slide` to do something smarter than looking up each requested snapshot separately; instead, try to apply updates to get from one snapshot to the next; or start searching for the next version of each observation from where the previous one was located; or checking whether the previous version of some value will still be there in the next version, and only performing lookups for those that will change.  Using a different key ordering and/or secondary index might help here.  [And/or some tricks with `as.list`, the `sorted` attribute, and `setDT`.  But these tricks might not be necessary; if there is an efficient issue-set partitioning function / application of `by`&`.SD`, the DT could be split into parts with issue <= the min(ref_time_value), min(ref_time_value) < issue <= secondlowest(ref_time_value), etc., each part smashed into a simplified update using rolling joins, and then the slide simply applying the updates sequentially... or some modification of this to handle the simultaneously rolling time_value window.]
  - Similarly, for `all_versions=TRUE`, consider performing the partition above and iteratively appending to get the appropriate DTs, rather than filtering the original.  Or figure out some way to not have to form a separate DT at all, like a filter view subclass.
- Allow duplicate non-version-key values in the `DT`; then, if an update changes one value with a particular non-version-key, it must re-report all unchanged values with the same non-version-key as well.  For example, if the DT contains full snapshots for every version, then the non-version-key could be empty.  Or the non-version-key could be just the `time_value` if every geo's data for a time value is re-reported if any geo's data for that time value has been revised.  This reduces the key size and maybe the number of lookups required, but will probably necessitate some changes to how `as_of`, etc., are implemented.
- [Use native code?  E.g., Rust via rextendr, which may not require a license change and might be directly Python-portable, or Rcpp which has tighter R integration but might require a license change.]
- Avoid an unnecessary copy that likely occurs when converting each snapshot into an `epi_df` in `$as_of`.
- Try storing `version_lag` rather than `version`, transforming some computations required.
- Try putting `version_lag` as the first part of the key, allowing fast filters based on `before`, but requiring the use of `unique` rather than potentially using rolling joins (although an index could be used to speed these up).
- Adding a `version_lag` column and storing an index for it, to speed up `before` filter.
- Add options for warm starts from nearby times/groups.
- [Consider alternative backends: [different data frame variants](https://github.com/cmu-delphi/epiprocess/issues/265), using one data frame variant per epikey to see if reducing the key length at the cost of more function applications is overall helpful / enabling hash/other approaches to epikey lookups faster than having them be part of the binary search over epikey-time-versions or whatever other algorithm data.table uses.  If function call overhead or memory nonlocality is an issue, then maybe a chunked approach could be considered, though this probably requires at least one additional key column in the data.tables.]
- ...

Any performance improvements here should be planned together with considerations of alternatives to storing data in RAM.  E.g., if the RAM storage is abandoned, then some of these potential improvements may not apply, or would need to be implemented in a different way.  [They should also be coordinated with any move to S3+dtplyr; completing these changes in parallel would likely involve substantial merge conflicts.]

[Consider also `tibble::as_tibble()` -> using `setDF` in `as_of`, although may need checks that original DT or DT columns aren't aliased in exceptional circumstances.]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider performance improvements for `epix_slide`, `as_of` #76

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider performance improvements for epix_slide, as_of #76

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Consider performance improvements for `epix_slide`, `as_of` #76