-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-2563] [Bug] Incremental updates using unique_key result in duplicates if fields in the unique_key are null #159
Comments
Thanks for opening @amardatar ! Your proposal for After all, "there are only two hard things in Computer Science: cache invalidation, naming things, and three-valued logic." Prior artI believe this aligns with the opposite logic that we see in snapshots for discovering when rows are different ("is distinct from"): Related feature requestWe have an open feature request for an ergonomic implementation of If we were to implement it, then your update might* be as simple as: {{ dbt.is_not_distinct_from(source_key, target_key) }} * depending on final macro name and assuming you already have Jinja variables |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Commenting to keep this open - I consider it's still relevant and there's a PR for it. |
@dbeatty10 @martynydbt , is there any update on this ? |
Is this a new bug in dbt-core?
Current Behavior
When dbt runs an incremental update which uses a
unique_key
, if thatunique_key
has fields which are null, dbt will insert "duplicate" rows.Expected Behavior
Null fields should not result in duplicate rows, but should be overwritten when an equivalent (based on is null equivalence) row is available in the update.
Steps To Reproduce
unique_key
As an example, with the SQL:
and YAML:
A first run will produce a table with:
And a second run will update the table to:
The middle row should have been deleted.
Relevant log output
No response
Environment
Which database adapter are you using with dbt?
redshift
Additional Context
I've chosen the Bug template for this, but I don't know if it counts as a bug or a feature (or something else).
Broader context:
In most databases, null comparison is done via the
is null
operator, andvalue = null
will returnnull
(which will be treated asfalse
). Databases typically don't allow nullable fields in their primary key, from a bit of research that appears to be the main reason why.In some senses, the
unique_key
option in dbt is corollary to a primary key in a database table. In practice - dbt tends to be used in data warehouse scenarios, which often don't enforce (or even allow) primary keys, and it's often the case that by necessity source tables being transformed will include nullable fields in what would be considered their unique key. Application designers can usually handle this by adding an extra field which is an actual primary key to handle this scenario; analytics engineers typically don't have this freedom and are required to use the data as it exists upstream. As such, I suspect this behaviour would be preferred by a number of other analytics teams, as it allows them to use incremental behaviour in scenarios like this.Issue details:
The issue itself boils down to how the deletion is handled during an incremental merge. At present, the deletion uses the condition
{{ target }}.{{ key }} = {{ source }}.{{ key }}
. The change I'm proposing is essentially to change this to({{ target }}.{{ key }} = {{ source }}.{{ key }} or {{ target }}.{{ key }} is null and {{ source }}.{{ key }} is null)
.Suggestions:
Assuming this is an issue others face as well and that there's a desire to implement a change for, I'd imagine this could either be done by:
unique_key
this is the behaviour they're expecting; ornull
values to be compared using the above method if any fields in theunique_key
are nullable.The text was updated successfully, but these errors were encountered: