[Bug] Unexpected schema caching on transitive column updates resulting in execution failure #1331

github-christophe-oudar · 2024-09-03T22:54:04Z

Is this a new bug in dbt-bigquery?

I believe this is a new bug in dbt-bigquery
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Let's consider a model A:

{{
    config(
        materialized='table'
}}
SELECT "foo" as foo

and a model B:

{{
    config(
        materialized='incremental',
        incremental_strategy = 'insert_overwrite',
        on_schema_change='append_new_columns',
        partition_by= {
          'field': 'hour',
          'data_type': 'timestamp',
          'granularity': 'hour',
          "time_ingestion_partitioning": true,
          'copy_partitions': true
        }
    )
}}
SELECT TIMESTAMP(CURRENT_TIMESTAMP()) as hour, *
FROM {{ ref('A') }}

Then let's update

{{
    config(
        materialized='table'
}}
SELECT "foo" as foo, "bar" as bar

Running again both models soon (< 1h for standard BQ edition, < 24h for enterprise edition) would fail on the model B execution saying that the insertion failed with:
Inserted row has wrong column count; Has 3, expected 2 at ...

Expected Behavior

You would expect the model B to be successful.

Now my gut feeling is that: we have this column scanning helper request that is requesting model B with a WHERE false LIMIT 0 which works great... but is apparently cached by BigQuery.
And usually BigQuery manages to figure out it should bust the cache... but practically, it isn't apparently.
BigQuery has an open issue for the problem: https://issuetracker.google.com/issues/343009002

So the fix would be to enforce NOT using the caching on that request which is pretty much free anyway.
I guess it should be reported to Google BigQuery team as well as it's a cache busting issue but I'm not sure how easy or hard it is to fix.

For company with BQ Enterprise editions, it's actually even more frustrating as the cache is 24h long.

So far, the workaround is either:

Not use a SELECT * so that downstream models are updated and the request wouldn't be cached
Rerun the query with WHERE false LIMIT 0 manually without the cache

It is at least affecting the time_ingestion_partitioning but maybe it could affect other modes for when we perform schema comparison for on_schema_change needs.

Steps To Reproduce

See "Current behavior"

It's also doable in pure SQL:

CREATE OR REPLACE TABLE `my_project.my_dataset.model_a`
AS select "foo" as foo

✅

CREATE OR REPLACE TABLE `my_project.my_dataset.model_b`
AS select * from `my_project.my_dataset.model_a`

✅

select * from `my_project.my_dataset.model_a` WHERE FALSE LIMIT 0

✅

CREATE OR REPLACE TABLE `my_project.my_dataset.model_a`
AS select "foo" as foo, "bar" as bar

✅

select * from `my_project.my_dataset.model_a` WHERE FALSE LIMIT 0

🔴 the result of the query is cached with the previous schema

Relevant log output

No response

Environment

- OS: MacOS
- Python: 3.11.9
- dbt-core: 1.8.2
- dbt-bigquery: 1.8.3

Additional Context

No response

The text was updated successfully, but these errors were encountered:

amychen1776 · 2024-09-05T20:49:18Z

@github-christophe-oudar have you reported this to the BQ team/opened up a support ticket with GCP? I'm not opposed to trying to resolve this on the adapter but it seems like the source is BQ and ideally it is resolved there.

github-christophe-oudar · 2024-09-06T08:22:24Z

It's already opened on BQ side (for few months), you can see it on https://issuetracker.google.com/issues/343009002

amychen1776 · 2024-09-09T12:43:58Z

@github-christophe-oudar yes - I also recommend opening up an actual ticket directly with your account team to push on this. Tickets are how we can push with our GCP contacts. Since this is directly related to a GCP issue, I'm going to close this ticket for now (appreciate you opening up this issue to flag it to us! We are also going to share it with our GCP contacts too)

github-christophe-oudar · 2024-09-09T13:33:23Z

I'll see what I can do but then we don't have any control on the ETA for a fix.

github-christophe-oudar added type:bug Something isn't working triage:product labels Sep 3, 2024

amychen1776 added triage:awaiting-response and removed triage:product labels Sep 5, 2024

github-actions bot added triage:product and removed triage:awaiting-response labels Sep 6, 2024

amychen1776 closed this as not planned Won't fix, can't repro, duplicate, stale Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Unexpected schema caching on transitive column updates resulting in execution failure #1331

[Bug] Unexpected schema caching on transitive column updates resulting in execution failure #1331

github-christophe-oudar commented Sep 3, 2024 •

edited

Loading

amychen1776 commented Sep 5, 2024

github-christophe-oudar commented Sep 6, 2024

amychen1776 commented Sep 9, 2024

github-christophe-oudar commented Sep 9, 2024

[Bug] Unexpected schema caching on transitive column updates resulting in execution failure #1331

[Bug] Unexpected schema caching on transitive column updates resulting in execution failure #1331

Comments

github-christophe-oudar commented Sep 3, 2024 • edited Loading

Is this a new bug in dbt-bigquery?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Additional Context

amychen1776 commented Sep 5, 2024

github-christophe-oudar commented Sep 6, 2024

amychen1776 commented Sep 9, 2024

github-christophe-oudar commented Sep 9, 2024

github-christophe-oudar commented Sep 3, 2024 •

edited

Loading