Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge Records Based on Individual Record's Last Modified Timestamp #2145

Open
trin94 opened this issue Dec 13, 2024 · 2 comments
Open

Merge Records Based on Individual Record's Last Modified Timestamp #2145

trin94 opened this issue Dec 13, 2024 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@trin94
Copy link

trin94 commented Dec 13, 2024

Feature description

We would like to incrementally update individual rows based on a LastModified timestamp at row level.

Are you a dlt user?

Yes, I run dlt in production.

Use case

We use the filesystem module to incrementally load data into a database, determined by each file's modification_date.
Now, we want to add another condition to filter out outdated data.

Our data is stored in jsonl files, each containing 1-n individual records. Each record has a unique primary key and an individual LastModified timestamp. We would like to update each row in the database only if a new record has a more recent LastModified timestamp for the same primary key.

We tried implementing this via the dlt.sources.incremental functionality, but as far as we understand, this tracks a LastModified value for the entire table but not for each record as we would need it.

As we receive data in batches and cannot control the order of updates to individual rows, this is not sufficient.

Proposed solution

No response

Related issues

No response

@sh-rp
Copy link
Collaborator

sh-rp commented Dec 16, 2024

Hey @trin94, if you just merge all incoming data on the primary key, would this not work? Or so sometimes batches come that have rows with last_modified timestamps that are older than the one in the db?

@sh-rp sh-rp added the question Further information is requested label Dec 16, 2024
@sh-rp sh-rp self-assigned this Dec 16, 2024
@sh-rp sh-rp moved this from Todo to Planned in dlt core library Dec 16, 2024
@frank-engelen
Copy link

Hi @sh-rp, unfortunately, yes, that is possible. In our use case (@trin94 and mine), we sometimes receive a JSONL file where the modification_date is newer than that of all previous files. Some records in the file may have a LastModified date that is newer than what's currently in the database for their primary key (PK), and these need to be loaded. However, other records in the same file might not be the most recent version for their PK, as the latest version is already in the database (with a newer LastModified-Timestamp). Those records in the file need to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: Planned
Development

No branches or pull requests

3 participants