Consider validating sensible key values in `epi_archive`, valid ops for & performance improvements from nonunique keys #89

brookslogan · 2022-06-02T17:45:12Z

Currently, we do not check for distinct key values in epi_archive.

User experience-wise: this is convenient but also gives them the opportunity to form unexpected or invalid archives.
Semantics: we assume there aren't duplicate key values. $as_of might give the right result if the duplicate keys have duplicate values and we don't want duplicate rows as output, but gives the wrong thing if instead user is trying to take advantage of a particular update-reporting structure which might enable some performance improvements, which we could take further advantage of by being more flexible with the key, described next.
Performance: we require a version-search for every key value across the single huge archive DT.
- If all (geo_value, time_value, otherkey1, ..., otherkeyn) are re-reported in every version --- we are working off of full snapshots in DT --- then we only need to look up the version once, and can key by version alone. But we can't use the unique lookup for as_of here; maybe a rolling join would work & generalize to the next case.
- If all (geo_value, otherkey1, ..., otherkeyn) are re-reported in every version for time_values version - 1:windowlength, then we could key just by (time_value, version).
- If we have patch-based reporting and no special guarantees, then we need to have (geo_value, otherkey1, ..., otherkeyn, version) as the key, and there should be no duplicates.
- (Stratifying the DT into multiple DTs, say, one per geo, might increase the number of lookups required but make them faster.)

Might interact with #87.

The text was updated successfully, but these errors were encountered:

brookslogan · 2022-06-07T18:58:59Z

Performance part is duplicate of part of #76.

brookslogan · 2024-09-09T22:26:09Z

as_epi_archive(tibble(geo_value=1,time_value=1,version=1,value=c(1,1)))
#> Error in `new_epi_archive()` at epiprocess/R/archive.R:516:3:
#> ! `x` must have one row per unique combination of the key variables. If you have additional key
#>   variables other than `geo_value`, `time_value`, and `version`, such as an age group column, please
#>   specify them in `other_keys`. Otherwise, check for duplicate rows and/or conflicting values for the
#>   same measurement.

brookslogan added P2 low priority performance labels Jun 2, 2022

brookslogan mentioned this issue Jun 7, 2022

Km compactify refiled #97

Closed

brookslogan added the REPL Improved print, errors, etc. label Jun 7, 2022

brookslogan added the op-semantics Operational semantics; many potentially breaking changes here label Jun 7, 2022

brookslogan mentioned this issue Jun 19, 2022

Consider separating out or unifying measurement nonexistence and NA-ness in archives #110

Open

brookslogan mentioned this issue Jul 5, 2022

improve epi df doc #103

Merged

nmdefries mentioned this issue Jul 21, 2022

Warn if a non-version key is duplicated in epi_df or epi_archive #154

Closed

brookslogan mentioned this issue Jul 13, 2023

Epix rbind #343

Draft

brookslogan closed this as completed Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider validating sensible key values in `epi_archive`, valid ops for & performance improvements from nonunique keys #89

Consider validating sensible key values in `epi_archive`, valid ops for & performance improvements from nonunique keys #89

brookslogan commented Jun 2, 2022 •

edited

Loading

brookslogan commented Jun 7, 2022

brookslogan commented Sep 9, 2024

Consider validating sensible key values in epi_archive, valid ops for & performance improvements from nonunique keys #89

Consider validating sensible key values in epi_archive, valid ops for & performance improvements from nonunique keys #89

Comments

brookslogan commented Jun 2, 2022 • edited Loading

brookslogan commented Jun 7, 2022

brookslogan commented Sep 9, 2024

Consider validating sensible key values in `epi_archive`, valid ops for & performance improvements from nonunique keys #89

Consider validating sensible key values in `epi_archive`, valid ops for & performance improvements from nonunique keys #89

brookslogan commented Jun 2, 2022 •

edited

Loading