Proposal for tracking row-based provenance #1113

rly · 2024-09-16T20:15:17Z

For common tables, users may accidentally (or intentionally) modify a row created by another user. So it is useful to know who was the last person who modified a row and when. Similarly, it is useful to know who created the row in the first place.

A couple types of actions (declare table, drop table, delete from table) are tracked in the DataJoint hidden ~log table that is part of each database (e.g., common_nwbfile), but those are limited to table-wide actions and deletions, and it may be hard to parse for everyday users.

There are many ways to do this, but since we care only about the latest state of each row, one common way seems to be to add columns to the data tables about the latest state:

created_by or creation_user (user) (with or without IP address)
created_on or created_time or creation_timestamp (datetime)
last_modified_by (user)
last_modified_on or last_modified_time or last_modification_timestamp (datetime)
last_modified_reason (string, e.g., "database migration" or "fix typo") (not sure if useful)
last_modified_spyglass_version (string) (not sure if useful)
(we could also add fields like spyglass_environment - a text-based dump of the virtual environment, host name, and operating system, but I think that is overkill and would result in a bloated database. we have thought about this for NWB and are settling on version of the primary software package as being good enough for our initial attempt at tracking provenance)

This is now doable in the latest version of DataJoint (not released yet?) without cluttering the common display of tables (i.e., the columns are "hidden" but can be queried/displayed).

Alternative: we could create a history table for every data table and add a foreign key from the data table to the history table. I'm not sure if this separation adds any value now that we can have hidden columns.

Concern: Both approaches will increase the size of the database. Is it worthwhile?

It would be nice if this were built into DataJoint as mysql triggers, but until then, we could add the values every time we call populate.

@CBroz1 noted:

I think it would require the mixin intercepting the definition attr before table declaration and appending new lines. There would also be some migration effort of altering all existing tables

Questions:

Is this enhancement worth making? How often are tables modified in a way that it is useful to track who/when/why a change was made?
If yes, what do you think about the approach?

The text was updated successfully, but these errors were encountered:

samuelbray32 · 2024-09-16T20:30:07Z

Just putting here since tangential (though doesn't cover as much as proposed); we are currently tracking spyglass version in gereated nwb files (#897, #900)

CBroz1 · 2024-09-17T21:42:07Z

Problem

I'd like to state the problem concisely before exploring solutions:

Shared table use leaves (a) data provenance in question when users may have different versions of the package, (b) data integrity in question when users may update upstream rows.

Does that capture it? I'll refer to these as 'the version problem' and 'the update problem' as I think out loud below.

Solutions

Hidden fields

Row-wise insertion/update information could track everything

Pros:
- Tracks everything
- Easily searched
- Tightly coupled to relevant data
Cons:
- Very data-dense. Certainly nothing in comparison to recordings, but a lot if unnecessary.
- Perhaps very redundant in cases: all inserts for same session by same user with same spyglass version
- Abundant relative to infrequent use? Unlikely to be examined unless something looks wrong in an analysis, in which case the analysis will likely need to be recomputed, regardless of what is learned. This would just change whether someone would rerun from Raw versus an intermediate step - I may be wrong, but it seems like limited utility
Implementation effort: 6/10?
- Easy to add to the mixin
- Many many secondary key alters, which usually run smoothly but would cause headaches if tables were forgotten or mismatched across pre/post implementation

Data density could be mitigated by being selective in which tables had these fields. Merge tables or file tables (Nwbfile, AnalysisNwbfile) could be a good candidates, offering less protection against the update problem but decent timestamp gathering on who added a row when.

File as sparse monitoring

As Sam mentioned, spyglass version in nwb files already provides a sparse solution for the version problem, but it may be sufficiently dense to delete all cases in between file saves. Is an additional tool needed to compare across all files from the same session? Is an additional field needed for user?

Pros: tracks version as sufficient intervals
Cons:
- Hard to cross-check across files?
- No protection against update problem
Implementation effort: 0/10, already present. 3/10 for additional tools?

Expanding built-in log use

The mixin would allow us to intercept insert and/or update to add additional log rows for each. Likely a better case for the update problem than the version problem.

Pros:
- Ease of implementation
- Provides a timeline of events
Cons:
- Poor coupling to data - how do I run a check on whether or not a given key was subject to upstream updates?
- Shoehorning version/primary key info into existing event varchar(255)
- Poor search-ability of substrings in the event
Implementation effort: 2/10?
- Easy to add to the mixin
- Some concerns about how to log and what, and what if the pk doesn't fit in 255 characters
- Perhaps 5/10 if we want a mechanism to check for updates upstream of a given entry

Custom log table

If row-wise is too dense for version tracking, can we keep track of when a user updates their environment, and when their environment is editable, using snippets provided by Ryan in #1087.

A new 'user history' table loads whenever spyglass is imported. It checks the user's current spyglass version and/or conda environment against their last entry, and inserts a row if different from last time

Pros:
- Sparse data, only keeping track of changes
- Provides a timeline of who updated when
- Protects against files being saved with known bugs
Cons:
- Poor coupling to data
- May require cross-referencing with one of the above methods to be useful, like log on insert, or looking at file creation time. Helper funcs could determine version(s) given a date range, and date range of analysis files given a session
Implementation effort: 3/10?
- Easy to add to start-up process
- Some questions about what to log and diff-checking with previous entry

A BugHistory lookup table could list versions and tables whose saved files were impacted. A BugChecker would look at which files were saved with that version, by that table, and return the list of files for suggested reprocessing, along with the session experimenter. This is easier if version is added to AnalysisNwbfile as a field

Permissions changes

If we're worried about 'update', we can disable it for non-admin either in SQL or in the python interface.

Pros: Only addresses the update problem
Cons: Increased burden on admin to execute legitimate cases
Implementation effort: 1/10, 2/10 for ongoing support

Replication tool

If we're worried about whether or not a given downstream key/file could be exactly replicated, we could instead focus on doing that, rather that data tracking possible errors.

Given a downstream key, we can already export a vertical slice of the database, with all paramset table pairings. I spent some time exploring such a tool in #1057 with the idea of running the same analysis on a different session. We could fully solve the version problem for a given analysis by instead rerunning the same session(s).

Pros: Full confidence in outcome
Cons:
- Rerunning with a legacy version may require separating into another package
- High implementation burden
Implementation effort: 8/10
- I hit some issues with cascading selections in my current PR
- Do we want to rerun in docker?
- What is the standard for replication success? May look different for each case
- How to handle manual curation and supporting files, like DLC models? Take as input?

Hypothetical goal: The production database a staging ground; to finalize analyses means a dockerized from-scratch rerun

General thoughts

I propose adding a line to the builtin log for update1 and some sort of tool on Nwbfile to check for updates across schemas for a given session.

I'm in favor of sparser methods of protecting against the version issue until we see cases of definite red flags where we retroactively wished we had access to row-wise logging such as this. My gut says that adding the user history log, and adding created-by user to analysis files, is enough to cross reference for who made which file when. Not a full solve of the version problem, but perhaps an adequate record to prevent full reprocessing

edeno added the enhancement New feature or request label Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for tracking row-based provenance #1113

Proposal for tracking row-based provenance #1113

rly commented Sep 16, 2024

samuelbray32 commented Sep 16, 2024

CBroz1 commented Sep 17, 2024

Proposal for tracking row-based provenance #1113

Proposal for tracking row-based provenance #1113

Comments

rly commented Sep 16, 2024

samuelbray32 commented Sep 16, 2024

CBroz1 commented Sep 17, 2024

Problem

Solutions

Hidden fields

File as sparse monitoring

Expanding built-in log use

Custom log table

Permissions changes

Replication tool

General thoughts