Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bundle: Version control for schema #288

Open
erichanson opened this issue Dec 6, 2023 · 0 comments
Open

bundle: Version control for schema #288

erichanson opened this issue Dec 6, 2023 · 0 comments
Labels
epic big meta-ticket that is more of an outcome than a specific task refactor

Comments

@erichanson
Copy link
Member

erichanson commented Dec 6, 2023

Right now bundle just versions "regular" data, actual rows in normal tables. Tracking schema rows (meta.table, meta.column, meta.schema, meta.function, meta.type, etc.) is the next frontier. Past experiments have have shown it's possible to at least commit meta rows and then check them out, but certainly not with any level of robustness or fault tolerance.

Making this change necessitates some clarifications of the information model:

New Columns

In the current model, a commit is a set of:

  • rows added
  • rows deleted
  • changed fields

When a row is added, it gets a rowset_row_field for every column in the row's table, basically whatever is present in the db at the time.

Now, let's say the user does a alter table public.x add column y. What happens?

a) Untracked Meta Row

  • If meta.column is being tracked in bundle.trackable_nontable_relation, (currently a global, which is wrong, bundle: per-bundle ignore rules #287), then a new meta.column row will appear in untracked_rows, with row_id (meta, column, id, meta.column_id('public','x','y')).
  • Same with a column's other potential rows in meta, meta.constraint_unique, meta.constraint_check, meta.foreign_key, etc.

Constraints will have a pg_depend-level dependency on the column (I believe) and the column will have a dependency on its table.

b) New Fields on Old Rows

After column creation, a new field appears on each row in the table. What are these?? Are they "untracked_fields"? (Not a thing). Are they just another field change, from void to existence? It is a different category of change, that does not fit into the above model.

  • On commit, I expected that fields in new columns would just propagate into rowset_row_field, but that is not what is happening. The stage_row_field view just pulls fields from the previous commit, and doesn't detect new columns on old rows. It does however use meta.relation_column for new rows, so those rows do get a field for the new column.
  • stage_row_field is supposed to contain what rowset_row_field would look like if a commit was done right now. Commit actually just pulls from this view to create the next commit. But when a column is added, what should the next commit look like?
    • if this bundle is not tracking its own schema, pretty clearly the field should just included for each row in the next commit
    • if it is tracking its own schema, if the column is created but not staged, its fields should not be included?

c) Field-level Versioning vs Row-level Versioning

This distinction is at the heart of all the questions above.

It's tempting to try to pursue "field-level" tracking, and add a new dimension to the information model for "untracked_field" and "field_added". I don't believe this is impossible, but it's two steps (leaps) forward. Ideally, multiple bundles could version in an extremely granular way, version control a single field, or a single column. One bundle defines a table and some columns, but another bundle adds a column to that table and tracks its fields. This is beyond what even git does AFAICT; the git equivalent would be multiple git repositories operating in the same directory, version controlling different files simultaneously. This is the Semantic Web model, it's all just triples, a commit is a collection of meta.field_ids, period.

The simpler path is to, for this next step at least, just say that a commit saves whatever fields are present on the row at commit time. In head_db_stage, we can at least detect and display that a row has new fields, or fields deleted.

It's going to be important to keep an eye on the distinction between a field whose value has been set to NULL, and a field whose column has been deleted.

Some new functions get_commit_fields(commit_id), get_commit_rows(commit_id), get_db_fields(commit_id) and get_db_rows(commit_id) are much more axiomatic and precise (and fast) functions for addressing this space. The get_commit_* functions return exactly which rows and fields were in the previous commit, disregarding the working copy entirely. get_db_rows() returns rows from the supplied commit, along with whether or not each row exists in the db. get_db_fields() returns all fields present in the database on the supplied commit's rows; rows from the supplied commit that have been deleted are not present.

These functions are the foundation for refactoring the entire space with an eye on schema changes.

Column Renames

In terms of meta, they're a deleted row and an added row in meta.column. Maybe we can do something smarter. One cool idea is to detect them by seeing if the oid is the same.

Deleted Columns

Again, if we're doing field-level versioning, this is a new category of change, a field_delete. If not, just include whatever columns are present in the db at commit-time, in which case this is easy.

Object Dependencies

See #283.

@erichanson erichanson added epic big meta-ticket that is more of an outcome than a specific task refactor labels Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic big meta-ticket that is more of an outcome than a specific task refactor
Projects
None yet
Development

No branches or pull requests

1 participant