Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Unit tests not working for incremental models #10686

Closed
2 tasks done
CraigWilson-ZOE opened this issue Sep 10, 2024 · 29 comments
Closed
2 tasks done

[Bug] Unit tests not working for incremental models #10686

CraigWilson-ZOE opened this issue Sep 10, 2024 · 29 comments
Labels
bug Something isn't working duplicate This issue or pull request already exists unit tests Issues related to built-in dbt unit testing functionality

Comments

@CraigWilson-ZOE
Copy link

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When running a unit tests for an incremental model I am getting the following error:

14:41:49    Database Error in unit_test unit_test__int_barcode_scanned_events_incremental_mode (models/silver/intermediate/food_logging/tests.yml)
  Syntax error: Unexpected "." at [18:44]

Expected Behavior

The unit tests to pass

Steps To Reproduce

  1. Environment is:
  • dbt 1.8.6
  • dbt bigquery 1.8.2
  • dbt-adapters 1.5.0
  • dbt-utils 1.3.0
  1. incremental model configuration:
{{
    config(
        materialized='incremental',
        incremental_strategy='insert_overwrite',
        unique_key='event_id',
        partition_by={
            "field": "date_day",
            "data_type": "date"
        }
    )
}}
  1. Unit test config:
unit_tests:
  - name: unit_test__int_barcode_scanned_events_incremental_mode
    description: Unit test for int_barcode_scanned_events in incremental mode
    model: int_barcode_scanned_events
    overrides:
      macros:
        is_incremental: true
    given:
        - input: ref('stg_insights_mixpanel_event')
          format: csv
          fixture: raw_mixpanel_events
        - input: this
          format: csv
          fixture: incremental_int_barcode_scanned_events
    expect:
        format: csv
        fixture: expected__int_barcode_scanned_events

Relevant log output

...
15:06:35  3 of 3 ERROR int_barcode_scanned_events::unit_test__int_barcode_scanned_events_incremental_mode  [ERROR in 1.31s]
...
15:06:58    Database Error in unit_test unit_test__int_barcode_scanned_events_incremental_mode (models/silver/intermediate/food_logging/tests.yml)
  Syntax error: Unexpected "." at [18:44]

Environment

- OS: macos 14.3.1
- Python: Python 3.12.2
- dbt: 1.8.6

Which database adapter are you using with dbt?

bigquery

Additional Context

The SQL that seems to be produced and is causing the error is the following:

/* {"app": "dbt", "dbt_version": "1.8.5", "profile_name": "default", "target_name": "development", "node_id": "unit_test.dbt_bq_elt.int_barcode_scanned_events.unit_test__int_barcode_scanned_events_incremental_mode"} */


      
        
            select distinct
                table_schema,
                table_name,
                
            case table_type
                when 'BASE TABLE' then 'table'
                when 'EXTERNAL TABLE' then 'external'
                when 'MATERIALIZED VIEW' then 'materializedview'
                else lower(table_type)
            end as `table_type`


            from `dbt-development-joinzoe`..INFORMATION_SCHEMA.TABLES
            where lower(table_name) like lower ('')
                and lower(table_name) not like lower ('')
@CraigWilson-ZOE CraigWilson-ZOE added bug Something isn't working triage labels Sep 10, 2024
@dbeatty10 dbeatty10 added the unit tests Issues related to built-in dbt unit testing functionality label Sep 10, 2024
@dbeatty10
Copy link
Contributor

Thanks for reporting this @CraigWilson-ZOE !

If you comment out the unit test completely, does it still give you that error?

The SQL that is creating that syntax error looks like it could be coming from bigquery__get_catalog or bigquery__get_catalog_relations. Did you happen to override either of those macros or any others? If so, which ones?

See below for a simplified version of the situation you described. It worked without error for me, so I'm curious if it works for you or not?

Example

models/int_barcode_scanned_events.sql

{{
    config(
        materialized='incremental',
        incremental_strategy='insert_overwrite',
        unique_key='event_id',
        partition_by={
            "field": "date_day",
            "data_type": "date"
        }
    )
}}

select 1 as event_id, cast('2002-02-02' as date) as date_day

models/_unit_tests.yml

unit_tests:
  - name: unit_test__int_barcode_scanned_events_incremental_mode
    description: Unit test for int_barcode_scanned_events in incremental mode
    model: int_barcode_scanned_events
    overrides:
      macros:
        is_incremental: true
    given:
      - input: this
        rows:
          - {event_id: 1, date_day: 2001-01-01}
    expect:
        rows:
          - {event_id: 1, date_day: 2002-02-02}

Commands to run:

dbt build -s int_barcode_scanned_events

Output:

$ dbt build -s int_barcode_scanned_events    
 
13:41:43  Running with dbt=1.8.6
13:42:10  Registered adapter: bigquery=1.8.2
13:42:11  Found 1 model, 473 macros, 1 unit test
13:42:11  
13:42:27  Concurrency: 10 threads (target='blue')
13:42:27  
13:42:27  1 of 2 START unit_test int_barcode_scanned_events::unit_test__int_barcode_scanned_events_incremental_mode  [RUN]
13:42:33  1 of 2 PASS int_barcode_scanned_events::unit_test__int_barcode_scanned_events_incremental_mode  [PASS in 5.48s]
13:42:33  2 of 2 START sql incremental model dbt_dbeatty.int_barcode_scanned_events ...... [RUN]
13:42:40  2 of 2 OK created sql incremental model dbt_dbeatty.int_barcode_scanned_events . [SCRIPT (56.0 Bytes processed) in 7.21s]
13:42:40  
13:42:40  Finished running 1 unit test, 1 incremental model in 0 hours 0 minutes and 29.38 seconds (29.38s).
13:42:40  
13:42:40  Completed successfully
13:42:40  
13:42:40  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

@tsafacjo
Copy link

can I pick it ?

@dbeatty10
Copy link
Contributor

@tsafacjo We're not yet sure if this is a bug in dbt-core or if the root cause is something else.

If you're looking for a place to contribute to dbt, there's a list of issues we've labeled as "help wanted" here.

@tsafacjo
Copy link

ok thanks.

@CraigWilson-ZOE
Copy link
Author

Sorry for the late reply, I have been on vacation.

If you comment out the unit test completely, does it still give you that error?

I have commented out the unit test and I do not get the error anymore.

The SQL that is creating that syntax error looks like it could be coming from bigquery__get_catalog or bigquery__get_catalog_relations. Did you happen to override either of those macros or any others? If so, which ones?

No we do not override any of those macros.

See below for a simplified version of the situation you described. It worked without error for me, so I'm curious if it works for you or not?

I have tried with the super simple version and it passed without error.

{{
    config(
        materialized='incremental',
        incremental_strategy='insert_overwrite',
        unique_key='event_id',
        partition_by={
            "field": "date_day",
            "data_type": "date"
        }
    )
}}

select 1 as event_id, cast('2002-02-02' as date) as date_day
  - name: unit_test__int_barcode_test_increment
    description: Unit test for int_barcode_test
    model: int_barcode_test
    overrides:
      macros:
        is_incremental: true
    given:
      - input: this
        rows:
          - {event_id: 1, date_day: 2001-01-01}
    expect:
        rows:
          - {event_id: 1, date_day: 2002-02-02}

NOTE: I created a new model so I could run both at the same time. My original model still fails with the same error.

@CraigWilson-ZOE
Copy link
Author

Also worth posting here is my packages.yml, I have removed all unecessary packages and down to the bare minimum for the who project:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.3.0
  - package: fivetran/mailchimp_source
    version: [">=0.4.0", "<0.6.0"]
  - package: fivetran/mailchimp
    version: [">=0.7.0", "<0.10.0"]
  - package: fivetran/zendesk_source
    version: [">=0.10.0", "<0.14.0"]
  - package: fivetran/zendesk
    version: [">=0.10.0", "<0.14.0"]
  - package: fivetran/google_ads_source
    version: [">=0.9.1", "<0.11.0"]
  - package: fivetran/google_ads
    version: [">=0.9.1", "<0.11.0"]
  - package: fivetran/facebook_ads
    version: [">=0.6.0", "<0.8.0"]
  - package: fivetran/facebook_ads_source
    version: [">=0.6.0", "<0.8.0"]

@dbeatty10
Copy link
Contributor

I have tried with the super simple version and it passed without error.

Okay! The next step is to try to isolate whichever piece(s) of your model that is triggering the syntax error when it is executed via dbt unit tests.

Are you able to tweak that super simple version so that it is closer and closer to your actual model until you find something that breaks the simplified version?

@devmessias
Copy link
Contributor

Hi @CraigWilson-ZOE, I understand that you are sharing the compiled SQL rather than the model, correct? Could you please share it with me? I’d like to see if it’s similar to the other issues I was looking into in this PR: #10849. If the model has any kind of var, env_var, or other thins that cames from context this can cause this kind of error during the unit test phase

@devmessias
Copy link
Contributor

devmessias commented Oct 16, 2024

Why I'm saying that @CraigWilson-ZOE . Your compiled sql is wrong due the double "..", which will be the result of unit test if you have a model like this

select
    count(*) as result
from {{target.database}}.{{env_var('schema')}}.INFORMATION_SCHEMA.TABLES

result

select
    count(*) as result
from dbt.INFORMATION_SCHEMA.TABLES

@kangshung
Copy link

Are there any updates on this? I have the same problem @dbeatty10

@devmessias
Copy link
Contributor

devmessias commented Oct 22, 2024

I think it's a good idea if you post the model.sql file you're having issues , @kangshung

@dbeatty10 dbeatty10 added the duplicate This issue or pull request already exists label Oct 25, 2024
@dbeatty10
Copy link
Contributor

Oops, I closed it too hastily!

@kangshung as @devmessias said, what you reported in #10686 (comment) is a duplicate of #10433

But we still don't have steps to reproduce for what @CraigWilson-ZOE originally reported in this issue.

@CraigWilson-ZOE does #10686 (comment) look similar to your model that is having the issues with unit testing?

@dbeatty10 dbeatty10 reopened this Oct 25, 2024
@dbeatty10 dbeatty10 added awaiting_response and removed duplicate This issue or pull request already exists labels Oct 25, 2024
@CraigWilson-ZOE
Copy link
Author

Hi All,

Sorry I have been away from my computer for a while.

Now that I am back and had some time to dive into this, it seems like a call to an internal dbt-core macro is failing. The SQL that is being executed is:

            select distinct
                table_schema,
                table_name,
                
            case table_type
                when 'BASE TABLE' then 'table'
                when 'EXTERNAL TABLE' then 'external'
                when 'MATERIALIZED VIEW' then 'materializedview'
                else lower(table_type)
            end as `table_type`


            from `development`..INFORMATION_SCHEMA.TABLES
            where lower(table_name) like lower ('')
                and lower(table_name) not like lower ('')

I believe this is coming from get_relations_by_pattern. We use this to generate some SQL for the incremental model, and it seems like the schema section for this macro isn't getting set within dbt within a unit test. So it isn't directly related to the unit tests, but is a side effect of the overall environment that is setup within dbt during unit test executions.

Is there a reason that the schema wouldn't be set in this instance?

I have checked via logging output of the target and this objects, the this object is empty for all parts of the model information, whereas target object does have the information for the most part.

I do think this is a bug and needs to be fixed otherwise we are saying that some macros cannot be used in a incremental model.

@devmessias
Copy link
Contributor

devmessias commented Oct 29, 2024

Hi @CraigWilson-ZOE, this seems similar to #10891 and #10139. This is not the raw code for the model, right? Can you share the raw code? As I discuss in #10908 maybe we should allow overriding dynamic variables

@devmessias
Copy link
Contributor

In the meantime, testing a simpler model that uses the same macro in a non-incremental model might help. I tend to believe that your issue isn't related to the model being incremental or not

@CraigWilson-ZOE
Copy link
Author

Hi @CraigWilson-ZOE, this seems similar to #10891 and #10139. This is not the raw code for the model, right? Can you share the raw code? As I discuss in #10908 maybe we should allow overriding dynamic variables

Yes it does look the same as #10891 , thanks!

I am not sure about overriding dynamic variables vs just having them work as expected by many people or provide an alternative.

In the meantime, testing a simpler model that uses the same macro in a non-incremental model might help. I tend to believe that your issue isn't related to the model being incremental or not

I agree that it isn't related to it being incremental or not, it is due to the above issue.

@devmessias
Copy link
Contributor

I don't think there is an easy way to do that, as pointed in #8499. The CTE approach will solve some limitations but will impact the performance , and will create somo additional problems according to #8499 (comment)
Overriding dynamic variables using the yaml would be easier to implement as a new feature, and we could still keep the performance for unit tests, similar to the mocked behavior in Python's unittest.

But in your case, if you have just one macro call could you try to override the macro?

@dbeatty10
Copy link
Contributor

@CraigWilson-ZOE could you share your line of code that you are using for dbt_utils.get_relations_by_pattern so we can see the schema argument?

I'm trying to reproduce the same error you are getting, but I'm not able when I use target.schema for the schema within dbt_utils.get_relations_by_pattern.

@devmessias
Copy link
Contributor

There are so many issues whose root cause is this limitation, and users are incorrectly associating it with other things. Wouldn't it be helpful to improve the unit test doc to emphasize this limitation, @dbeatty10?

@CraigWilson-ZOE
Copy link
Author

CraigWilson-ZOE commented Oct 31, 2024

@dbeatty10

Here is a breakdown of the SQL that I have and what is causing the issue. This is a change I made to ensure that getting the latest partition data in BigQuery is much more efficient than doing a query on the target table. This has saved a lot of processing on our large tables.

Macro to get the latest partition info:

{% macro get_latest_partition_value(table, partition_data_type=none) %}
    {% if not execute %}
        {{ return(modules.datetime.date.fromisoformat('2023-01-01')) }}
    {% endif %}
    {% if not dbt_utils.get_relations_by_pattern(table.schema, table.name) %}
        {{ return(modules.datetime.date.fromisoformat('1900-01-01')) }}
    {% endif %}
    {% set node = graph.nodes.get("model." + project_name + "." + table.name) %}
    {% if node is none %}
        {{ return(none) }}
    {% endif %}
    {% if not partition_data_type %}
        {% set raw_partition_by = config.get('partition_by', none) %}
        {% if not raw_partition_by %}
            {% do exceptions.raise_compiler_error("partition_by must be set in the model config to get latest partition value") %}
        {% endif %}
        {% set partition_by = adapter.parse_partition_by(raw_partition_by) %}
        {% set partition_data_type = partition_by.data_type %}
    {% endif %}

    {%- set query -%}
        SELECT MAX(
        {% if partition_data_type == 'date' %}
            PARSE_DATE("%Y%m%d",partition_id)
        {% elif partition_data_type == 'timestamp'%}
            PARSE_TIMESTAMP("%Y%m%d",partition_id)
        {% elif partition_data_type == 'datetime' %}
            PARSE_DATETIME("%Y%m%d",partition_id)
        {% elif partition_data_type == 'int64' %}
            CAST(partition_id AS INT64)
        {% else %}
            {% do exceptions.raise_compiler_error("partition_by data type must be date, timestamp, datetime or int64") %}
        {% endif %}
        )
        FROM {{table.database}}.{{table.schema}}.INFORMATION_SCHEMA.PARTITIONS
        WHERE 
            table_name = '{{table.name}}' 
            AND partition_id IS NOT NULL 
            AND partition_id != '__NULL__' 
            AND partition_id != '__UNPARTITIONED__'
            AND total_rows > 0
    {% endset %}

    {% set results = run_query(query) %}

    {% set partition_max_value = results.columns[0].values()[0] %}
    {{ return(partition_max_value) }}

{% endmacro %}

And it is used like this (example):

{% set partition_date = get_latest_partition_value(this) %}
{{
    config(
        materialized='incremental',
        unique_key='correlation_id',
        partition_by={
            "field": "search_date",
            "data_type": "date"
        },
        cluster_by=["message_type"],
    )
}}

WITH

all_search_logs AS
(
    SELECT DISTINCT
        correlation_id,
        timestamp AS search_time,
        username,
        payload,
        origin_service,
        message_type,
        DATE(timestamp) AS search_date,
        COALESCE(CAST(JSON_EXTRACT_SCALAR(metadata, '$.with_scores') AS BOOL), TRUE) AS with_scores
    FROM
        {{ source('pubsub_dataflow', 'pubsub_to_bigquery_partitioned') }}
    {% if target.name != 'production' %}
        WHERE
            timestamp >= (CAST(DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AS TIMESTAMP))
    {% else %}
        {% if is_incremental() and partition_date is not none %}
            WHERE
                timestamp >= TIMESTAMP('{{ partition_date }}')
        {% endif %}
    {% endif %}
)

SELECT * FROM all_search_logs

The table parameter in the macro during the unit test execution is NULL/NONE whereas in normal running or compilation it is correctly set as the target table.

@devmessias
Copy link
Contributor

devmessias commented Oct 31, 2024

This will not work with the current limitations of unit tests. However, you can override the macro since it is being called only once. In that case, having a feature that allows to override a dynamic variable is not necessary.

@dbeatty10
Copy link
Contributor

Thanks for providing those details @CraigWilson-ZOE ! I was able to get the same error as you with a simplified version of your scenario. And I was also able to resolve it by providing an override for the get_latest_partition_value macro in the unit test definition.

Explanation

It looks like these lines are the ones that won't work the way you want when you execute them in a unit test:

        FROM {{table.database}}.{{table.schema}}.INFORMATION_SCHEMA.PARTITIONS
        WHERE 
            table_name = '{{table.name}}' 

There's an explanation in #10891 (comment).

The TLDR is that within the context of unit tests, this is a str rather than a Relation, so things like {{table.schema}} will be empty.

Workaround

Could you try providing an override for your get_latest_partition_value macro like suggested by @devmessias ?

It would look something like this:

  - name: unit_test__int_barcode_scanned_events_incremental_mode
    description: Unit test for int_barcode_scanned_events in incremental mode
    model: int_barcode_scanned_events
    overrides:
      macros:
        is_incremental: true
        get_latest_partition_value: SOMETHING_HERE

@CraigWilson-ZOE
Copy link
Author

Ah that makes a lot of sense, and thanks for the example and explanation. Happy for this to be closed as I think a lot of it is covered in other issues as linked above.

@dbeatty10
Copy link
Contributor

Thanks again for raising this and helping home in on the root causes @CraigWilson-ZOE. Going to close this in favor of #10891 and other relevant issues linked above.

@dbeatty10 dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 4, 2024
@dbeatty10 dbeatty10 added duplicate This issue or pull request already exists and removed triage labels Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working duplicate This issue or pull request already exists unit tests Issues related to built-in dbt unit testing functionality
Projects
None yet
Development

No branches or pull requests

5 participants