Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache built kibblesets for later reuse #229

Open
Viqsi opened this issue Jan 18, 2024 · 4 comments
Open

Cache built kibblesets for later reuse #229

Viqsi opened this issue Jan 18, 2024 · 4 comments
Assignees
Milestone

Comments

@Viqsi
Copy link
Member

Viqsi commented Jan 18, 2024

The most frequent slowdown we run into in regular operations is when a kibble set has to be built for a given variable. We're already running into situations where we have to "cache" these results (witness the proliferation of All-EXISTS cohorts for FHVs in particular), so it might be worthwhile to just go the next step and formalize this caching process.

This could potentially speed up creation of cohorts and custom variables and downloads (for example, previewing a potential cohort would populate the cache and thus make the ensuing creation - and subsequent previews - much faster), simplfy the process of envaluing a custom variable, and keep cohortInds from becoming an untenable slow monster. It also cleanly separates an existing de facto system caching function from vital user data. We'd just have to remember to purge the kibble cache any time variable data changes (not often), make sure the resulting cache table(s) is/are supremely well indexed, and we're golden.

Something to give some serious thought IMO.

@Viqsi Viqsi added this to the Aspirational milestone Jan 18, 2024
@Viqsi Viqsi self-assigned this Jan 18, 2024
@Viqsi
Copy link
Member Author

Viqsi commented Jul 16, 2024

I had a different concept of how this might work come to mind. As described to @WValenti in Skype (with some edits):

The idea amounts to taking that concept I already have with "SVVFHVs" (the persistent table that is actually a cache) and doing something similar with 'kibble' - creating a new table that basically pares a copy of 'kibble' down to the few bits DIVER actually cares about - new_ind_id, variable_name (as determined by equivalence_groups.first_member), value.

The reasons why that are coming to mind:

  • The kibble table is currently ~5.7GB on disk, a little over two thirds of which is indexes.
  • A "down to DIVER's essentials" copy would likely be smaller, because of the elimination of both 1) "invalid" rows (read: rows for ind_ids and variable_ids in interactive tables) and, more importantly, 2) indexes for extra columns that this pared-down copy wouldn't need.

It'd have to be "rebuilt" each time we add data, but we're already having to rebuild and recheck things with additions and revisions to distributions. So make it a formal process and designate what specifically we use it for for DIVER, and leave the DIGS stuff (which is otherwise tracking necessary, vital, and yet convoluted information) out of it.

And that way any kibble operations - cohort creation, custom variable creation, download creation - is ostensibly much more simple, because the number of lookups needed for each row is drastically reduced and we don't have to do it with temporary generated stuff on the fly.

Heck, it'd eliminate the need for AECs for existing (non-custom) variables entirely.

So yeah. a thought coming to mind. I think it achieves the objective of Don't Have A Giant Sparse Table while still making DIVER work and while not impacting the DIGS backend itself.

(It'd probably also make deployment less of a headache for ISI.)

@WValenti
Copy link
Member

As discussed in person, kibble is subject to extensive development for the addition of longitudinal information. This is going to be a long-term project as it requires significant noodling first, and will involve UI changes as well.

@WValenti
Copy link
Member

Also, I believe in a cacheless society. Gimme credit for my experience - cache causes problems. ;-)

@Viqsi
Copy link
Member Author

Viqsi commented Jul 25, 2024

I mean, I would prefer not to have "cache tables" myself, but until MariaDB/MySQL give us native support for materialized views, the performance penalties are way too severe. :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants