Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new record in snapshot to track deleted state (is_deleted column, zombie records) #10235

Open
Tracked by #10151
gshank opened this issue May 28, 2024 · 5 comments · May be fixed by #10972
Open
Tracked by #10151

new record in snapshot to track deleted state (is_deleted column, zombie records) #10235

gshank opened this issue May 28, 2024 · 5 comments · May be fixed by #10972
Assignees
Labels
snapshots Issues related to dbt's snapshot functionality user docs [docs.getdbt.com] Needs better documentation

Comments

@gshank
Copy link
Contributor

gshank commented May 28, 2024

Description

Imagine order_id=1 is deleted from the table you're snapshotting.

If invalidate_hard_deletes=true, deleted records are invalidated by setting dbt_valid_to to the current timestamp:

order_id dbt_valid_from dbt_valid_to
1 2024-05-19 2024-05-20
2 2024-05-20

If invalidate_harde_deletes=false, deleted records are not invalidated, dbt_valid_to remains as NULL (you can’t really tell if they’ve been deleted):

order_id dbt_valid_from dbt_valid_to
1 2024-05-19
2 2024-05-20

What folks WANT, is tracking the “deleted” state as a state the record is in:

order_id dbt_valid_from dbt_valid_to dbt_is_deleted
1 2024-05-19 2024-05-20 False
1 2024-05-20 True
2 2024-05-19 False

and maybe later it comes back as a zombie record “it comes back to life”:

order_id dbt_valid_from dbt_valid_to dbt_is_deleted
1 2024-05-19 2024-05-20 False
1 2024-05-20 2024-06-03 True
1 2024-06-03 False
2 2024-05-19 False

Here’s the same exact same data as it exists today if you were using the invalidate_hard_deletes config. Note how there are no rows for order_id=1 between the time period of 2024-05-20 to 2024-06-03 — it has a “gap” during that time period. Contrast this with the proposed is_deleted column above that has “no gaps”.

order_id dbt_valid_from dbt_valid_to
1 2024-05-19 2024-05-20
1 2024-06-03
2 2024-05-19

Acceptance Criteria

Provide users a “new way to track deleted records”

invalidate_hard_deletes track hard deletes by adding a new record when row become "deleted" in source
implicit delete always explicit
has gaps has no gaps
smaller datasets larger datasets
  • new config hard_deletes (name tbd)
    • default (if it's not set) is hard_deletes = 'ignore', current behavior when invalidate_hard_deletes is not set today
    • replaces old invalidate_hard_deletes config, by setting hard_deletes = 'invalidate' (name tbd; other ideas 'invalidate_current_record')
    • new behavior is captured by setting hard_deletes = 'new_record' (name tbd; other ideas 'track', 'insert_new_record', 'is_deleted_flag', 'is_deleted_indicator', 'new_row')
    snapshots:
      my_project:
        +hard_deletes: new_record
    
  • can set config as project default or for individual snapshot
  • what should happen when someone changes this config? [spike?]
    • we offer migration utils in an adapter-agnostic way
    • support this config for net-new snapshots
  • handling deletions when a record comes back to life but updated_at field has not changed
  • should we also add the new dbt_is_deleted meta-field when hard_deletes = 'invalidate'?

Notes

  • what should happen when someone changes this config?
  • should new meta field be dbt_is_deleted or dbt_deleted_at? i think dbt_is_deleted makes more sense because the deleted at timestamp can be derived from whatever dbt_valid_from is for the deleted row
order_id dbt_valid_from dbt_valid_to dbt_is_deleted dbt_deleted_at
1 2024-05-19 2024-05-20 False
1 2024-05-20 2024-06-03 True 2024-05-20
1 2024-06-03 False
2 2024-05-19 False
@gshank gshank added the snapshots Issues related to dbt's snapshot functionality label May 28, 2024
@graciegoheen graciegoheen added the user docs [docs.getdbt.com] Needs better documentation label May 29, 2024
@owen-mc-git
Copy link

@graciegoheen @gshank this summary of #8207 seems like its discussing a deleted flag for a row, whereas 8207 was the treatment of a column in a row when the column was removed in source.
e.g
we had columns a,b,c in our source table, with values 'x, 'y', 'z' in each
column c was deleted in source, we'd be expecting a new row with values 'x', 'y', null (because z no longer exists)

DBT handles a new column from source gracefully and records the change but the reverse was not catered for. That was the premise of 8207. from reading this issue I do not get the same sense.

@graciegoheen
Copy link
Contributor

Thanks for clarifying @owen-mc-git - I'll re-open #8207!

@graciegoheen graciegoheen changed the title Improve capturing the history of records in a deleted state new record in snapshot to track deleted state (is_deleted column, zombie records) Jun 17, 2024
@graciegoheen
Copy link
Contributor

graciegoheen commented Jun 17, 2024

Migrations for hard_deletes config

  • from ignore to invalidate:
    • set dbt_valid_to to current timestamp for latest record of every unique_key that no longer exists in the source
  • from ignore to new_record:
    • create new dbt_is_deleted column set to false for every record
    • set dbt_valid_to to current timestamp for latest record of every unique_key that no longer exists in the source
    • insert new record with dbt_valid_from as current timestamp, dbt_valid_to null, and dbt_is_deleted true, for every unique_key that no longer exists in the source
  • from invalidate to new_record:
    • create new dbt_is_deleted column set to false for every record
    • insert new record with dbt_valid_from as latest dbt_valid_to, dbt_valid_to null, and dbt_is_deleted true, for every unique_key that has 0 records in the snapshot with dbt_valid_to as null
  • from new_record to invalidate:
    • delete every record where dbt_is_deleted is true
    • delete dbt_is_deleted column
  • from invalidate to ignore:
    • update dbt_valid_to to null for the latest record of every unique_id that has 0 records in the snapshot with dbt_valid_to as null
  • from new_record to ignore:
    • delete every record where dbt_is_deleted is true
    • delete dbt_is_deleted column
    • update dbt_valid_to to null for the latest record of every unique_id that has 0 records in the snapshot with dbt_valid_to as null

@peterallenwebb peterallenwebb self-assigned this Sep 25, 2024
@aditya96166
Copy link

can this above config be used and is available for general usage?

@peterallenwebb
Copy link
Contributor

@aditya96166 Not yet, but we are planning this for dbt-core 1.9, for which a beta will be available soon.

@peterallenwebb peterallenwebb linked a pull request Nov 4, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
snapshots Issues related to dbt's snapshot functionality user docs [docs.getdbt.com] Needs better documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants