[CT-2563] [Bug] Incremental updates using unique_key result in duplicates if fields in the unique_key are null #159

amardatar · 2023-05-11T13:11:40Z

Is this a new bug in dbt-core?

I believe this is a new bug in dbt-core
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When dbt runs an incremental update which uses a unique_key, if that unique_key has fields which are null, dbt will insert "duplicate" rows.

Expected Behavior

Null fields should not result in duplicate rows, but should be overwritten when an equivalent (based on is null equivalence) row is available in the update.

Steps To Reproduce

Create an incremental model which includes some null fields as part of the unique_key
Run the model first to create the initial table
Run the model again to perform an incremental update
The model will have duplicate rows

As an example, with the SQL:

select 'a' as field_1, 1 as field_2, current_timestamp as field_3
union all
select 'b' as field_1, null as field_2, current_timestamp as field_3

and YAML:

models:
  - name: test
    config:
      materialized: incremental
      unique_key:
        - field_1
        - field_2

A first run will produce a table with:

field_1	field_2	field_3
a	1	2023-05-11 11:17:33.832151 +00:00
b		2023-05-11 11:17:33.832151 +00:00

And a second run will update the table to:

field_1	field_2	field_3
a	1	2023-05-11 11:21:46.899969 +00:00
b		2023-05-11 11:17:33.832151 +00:00
b		2023-05-11 11:21:46.899969 +00:00

The middle row should have been deleted.

Relevant log output

No response

Environment

- OS: macOS Ventura 13.3.1
- Python: 3.11.1
- dbt: 1.4.6

Which database adapter are you using with dbt?

redshift

Additional Context

I've chosen the Bug template for this, but I don't know if it counts as a bug or a feature (or something else).

Broader context:

In most databases, null comparison is done via the is null operator, and value = null will return null (which will be treated as false). Databases typically don't allow nullable fields in their primary key, from a bit of research that appears to be the main reason why.

In some senses, the unique_key option in dbt is corollary to a primary key in a database table. In practice - dbt tends to be used in data warehouse scenarios, which often don't enforce (or even allow) primary keys, and it's often the case that by necessity source tables being transformed will include nullable fields in what would be considered their unique key. Application designers can usually handle this by adding an extra field which is an actual primary key to handle this scenario; analytics engineers typically don't have this freedom and are required to use the data as it exists upstream. As such, I suspect this behaviour would be preferred by a number of other analytics teams, as it allows them to use incremental behaviour in scenarios like this.

Issue details:

The issue itself boils down to how the deletion is handled during an incremental merge. At present, the deletion uses the condition {{ target }}.{{ key }} = {{ source }}.{{ key }}. The change I'm proposing is essentially to change this to ({{ target }}.{{ key }} = {{ source }}.{{ key }} or {{ target }}.{{ key }} is null and {{ source }}.{{ key }} is null).

Suggestions:

Assuming this is an issue others face as well and that there's a desire to implement a change for, I'd imagine this could either be done by:

Making the change directly, and assuming that if users have included a nullable field in their unique_key this is the behaviour they're expecting; or
Adding an additional config option to allow null values to be compared using the above method if any fields in the unique_key are nullable.

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2023-05-11T14:54:59Z

Thanks for opening @amardatar !

Your proposal for NULL-aware comparisons makes sense to me 🤩

After all, "there are only two hard things in Computer Science: cache invalidation, naming things, and three-valued logic."

Prior art

I believe this aligns with the opposite logic that we see in snapshots for discovering when rows are different ("is distinct from"):
https://github.com/dbt-labs/dbt-core/blob/630cd3aba063c570b856e5dc1f6e60cf112ea220/core/dbt/include/global_project/macros/materializations/snapshots/strategies.sql#L152-L158

Related feature request

We have an open feature request for an ergonomic implementation of is not distinct from here:
dbt-labs/dbt-core#6997

If we were to implement it, then your update might* be as simple as:

{{ dbt.is_not_distinct_from(source_key, target_key) }}

* depending on final macro name and assuming you already have Jinja variables source_key and target_key.

github-actions · 2023-12-12T01:47:51Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

amardatar · 2023-12-12T08:53:01Z

Commenting to keep this open - I consider it's still relevant and there's a PR for it.

sjain-cn · 2024-05-14T08:21:52Z

@dbeatty10 @martynydbt , is there any update on this ?

dbeatty10 · 2024-05-14T17:42:43Z

@sjain-cn there's an open PR that you can subscribe to: #110.

amardatar added bug Something isn't working triage labels May 11, 2023

github-actions bot changed the title ~~[Bug] Incremental updates using unique_key result in duplicates if fields in the unique_key are null~~ [CT-2563] [Bug] Incremental updates using unique_key result in duplicates if fields in the unique_key are null May 11, 2023

amardatar mentioned this issue May 11, 2023

Updating incremental merge WHERE condition to handle nullable fields … dbt-labs/dbt-core#7598

Closed

6 tasks

dbeatty10 self-assigned this May 11, 2023

dbeatty10 removed the triage label May 11, 2023

dbeatty10 removed their assignment May 11, 2023

dbeatty10 mentioned this issue May 11, 2023

[CT-2137] [Feature] Cross-database implementation of is_distinct_from dbt-labs/dbt-core#6997

Open

3 tasks

Hosuke mentioned this issue May 20, 2023

needs to use {{this}} format for incremental duneanalytics/spellbook#3341

Closed

15 tasks

dbeatty10 mentioned this issue Jun 14, 2023

[CT-2704] [Bug] When any of the unique keys is null incremental model duplicates data dbt-labs/dbt-core#7873

Closed

2 tasks

github-actions bot added the Stale label Dec 12, 2023

github-actions bot removed the Stale label Dec 13, 2023

amardatar linked a pull request Feb 27, 2024 that will close this issue

Null-safe equality for unique_key within delete+insert and merge incremental strategies #110

Open

4 tasks

dbeatty10 transferred this issue from dbt-labs/dbt-core Apr 10, 2024

martynydbt added the tracking_pr label Apr 25, 2024

dbeatty10 added the incremental Incremental modeling with dbt label May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-2563] [Bug] Incremental updates using unique_key result in duplicates if fields in the unique_key are null #159

[CT-2563] [Bug] Incremental updates using unique_key result in duplicates if fields in the unique_key are null #159

amardatar commented May 11, 2023

dbeatty10 commented May 11, 2023

github-actions bot commented Dec 12, 2023

amardatar commented Dec 12, 2023

sjain-cn commented May 14, 2024

dbeatty10 commented May 14, 2024

[CT-2563] [Bug] Incremental updates using unique_key result in duplicates if fields in the unique_key are null #159

[CT-2563] [Bug] Incremental updates using unique_key result in duplicates if fields in the unique_key are null #159

Comments

amardatar commented May 11, 2023

Is this a new bug in dbt-core?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Which database adapter are you using with dbt?

Additional Context

Broader context:

Issue details:

Suggestions:

dbeatty10 commented May 11, 2023

Prior art

Related feature request

github-actions bot commented Dec 12, 2023

amardatar commented Dec 12, 2023

sjain-cn commented May 14, 2024

dbeatty10 commented May 14, 2024