Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support incremental loading for scd2 #1789

Closed
gnilrets opened this issue Sep 6, 2024 · 5 comments · Fixed by #1818
Closed

Support incremental loading for scd2 #1789

gnilrets opened this issue Sep 6, 2024 · 5 comments · Fixed by #1818
Assignees
Labels
community This issue came from slack community workspace

Comments

@gnilrets
Copy link

gnilrets commented Sep 6, 2024

Feature description

Currently, the scd2 write disposition requires running a full extract of the source data in order to work; in other words, it does not support incremental loads. The main justification for this seems to be that doing a full extract is the only way to detect hard deletes. While true, there are plenty of situations where we would like to perform incremental scd2 loading where hard deletes are not a concern. At the moment, the workaround we have is to do a regular merge incremental load, followed by a dbt snapshots. But that effectively doubles the amount of storage needed in our warehouse and introduces an extra computational step.

Are you a dlt user?

Yes, I run dlt in production.

Use case

We have data in a SaaS product that contains large, mutable data and the vendor enforces rate limiting. Doing daily or more frequent regular full extracts is not feasible or cost effective. Having an incremental scd2 solution (even if it didn't manage hard deletes) would be very valuable to us.

Proposed solution

The scd2 write disposition supports incremental merge, with the caveat that hard deletes are not supported when doing so.

Related issues

No response

@anuunchin anuunchin added the community This issue came from slack community workspace label Sep 6, 2024
@rudolfix
Copy link
Collaborator

rudolfix commented Sep 8, 2024

@gnilrets thanks for feedback! Could you tell us how those merges should work? Currently all primary keys that are not present in the input datasets are retired. How would you limit that? We have merge_keys that could be used to do that ie. you'd be able to run scd2 for a given day or other partition. Is this what you mean? Or we should not do hard deletes at all (so your dataset can only be inserted or updated)?

@gnilrets
Copy link
Author

gnilrets commented Sep 9, 2024

Yeah, in this case, just don't do hard deletes.

@rudolfix
Copy link
Collaborator

@jorritsandbrink do you have a good idea on how to implement it in terms of user interface? add append-strategy to append? or use merge_key to limit which records are retired (empty key - no records are retired...) etc.?

@jorritsandbrink
Copy link
Collaborator

@rudolfix

append-strategy to append?

Not fitting because updates are involved.

use merge_key to limit which records are retired (empty key - no records are retired...) etc.?

Not very intuitive.

My suggestion:

@dlt.resource(
    write_disposition={"disposition": "merge", "strategy": "scd2", "retire_if_absent": False}
)
def my_incremental_dim_data():
    ...

where retire_if_absent defaults to True.

This is consistent with other scd2 config (active_record_timestamp/validity_column_names/etc.)

@gnilrets
Copy link
Author

Yep, @jorritsandbrink 's suggestion is what I was thinking too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community This issue came from slack community workspace
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants