Support incremental loading for scd2 #1789

gnilrets · 2024-09-06T15:07:56Z

Feature description

Currently, the scd2 write disposition requires running a full extract of the source data in order to work; in other words, it does not support incremental loads. The main justification for this seems to be that doing a full extract is the only way to detect hard deletes. While true, there are plenty of situations where we would like to perform incremental scd2 loading where hard deletes are not a concern. At the moment, the workaround we have is to do a regular merge incremental load, followed by a dbt snapshots. But that effectively doubles the amount of storage needed in our warehouse and introduces an extra computational step.

Are you a dlt user?

Yes, I run dlt in production.

Use case

We have data in a SaaS product that contains large, mutable data and the vendor enforces rate limiting. Doing daily or more frequent regular full extracts is not feasible or cost effective. Having an incremental scd2 solution (even if it didn't manage hard deletes) would be very valuable to us.

Proposed solution

The scd2 write disposition supports incremental merge, with the caveat that hard deletes are not supported when doing so.

Related issues

No response

rudolfix · 2024-09-08T18:37:13Z

@gnilrets thanks for feedback! Could you tell us how those merges should work? Currently all primary keys that are not present in the input datasets are retired. How would you limit that? We have merge_keys that could be used to do that ie. you'd be able to run scd2 for a given day or other partition. Is this what you mean? Or we should not do hard deletes at all (so your dataset can only be inserted or updated)?

gnilrets · 2024-09-09T16:51:11Z

Yeah, in this case, just don't do hard deletes.

rudolfix · 2024-09-11T09:08:57Z

@jorritsandbrink do you have a good idea on how to implement it in terms of user interface? add append-strategy to append? or use merge_key to limit which records are retired (empty key - no records are retired...) etc.?

jorritsandbrink · 2024-09-11T14:19:35Z

@rudolfix

append-strategy to append?

Not fitting because updates are involved.

use merge_key to limit which records are retired (empty key - no records are retired...) etc.?

Not very intuitive.

My suggestion:

@dlt.resource(
    write_disposition={"disposition": "merge", "strategy": "scd2", "retire_if_absent": False}
)
def my_incremental_dim_data():
    ...

where retire_if_absent defaults to True.

This is consistent with other scd2 config (active_record_timestamp/validity_column_names/etc.)

gnilrets · 2024-09-11T16:30:30Z

Yep, @jorritsandbrink 's suggestion is what I was thinking too.

anuunchin added the community This issue came from slack community workspace label Sep 6, 2024

rudolfix assigned jorritsandbrink Sep 16, 2024

jorritsandbrink mentioned this issue Sep 16, 2024

incremental scd2 with merge_key #1818

Merged

jorritsandbrink linked a pull request Sep 16, 2024 that will close this issue

incremental scd2 with merge_key #1818

Merged

rudolfix closed this as completed in #1818 Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support incremental loading for scd2 #1789

Support incremental loading for scd2 #1789

gnilrets commented Sep 6, 2024

rudolfix commented Sep 8, 2024 •

edited

Loading

gnilrets commented Sep 9, 2024

rudolfix commented Sep 11, 2024

jorritsandbrink commented Sep 11, 2024

gnilrets commented Sep 11, 2024

Support incremental loading for scd2 #1789

Support incremental loading for scd2 #1789

Comments

gnilrets commented Sep 6, 2024

Feature description

Are you a dlt user?

Use case

Proposed solution

Related issues

rudolfix commented Sep 8, 2024 • edited Loading

gnilrets commented Sep 9, 2024

rudolfix commented Sep 11, 2024

jorritsandbrink commented Sep 11, 2024

gnilrets commented Sep 11, 2024

rudolfix commented Sep 8, 2024 •

edited

Loading