Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Unexpected schema caching on transitive column updates resulting in execution failure #1331

Closed
2 tasks done
github-christophe-oudar opened this issue Sep 3, 2024 · 4 comments
Labels
bug Something isn't working triage

Comments

@github-christophe-oudar
Copy link
Contributor

github-christophe-oudar commented Sep 3, 2024

Is this a new bug in dbt-bigquery?

  • I believe this is a new bug in dbt-bigquery
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Let's consider a model A:

{{
    config(
        materialized='table'
}}
SELECT "foo" as foo

and a model B:

{{
    config(
        materialized='incremental',
        incremental_strategy = 'insert_overwrite',
        on_schema_change='append_new_columns',
        partition_by= {
          'field': 'hour',
          'data_type': 'timestamp',
          'granularity': 'hour',
          "time_ingestion_partitioning": true,
          'copy_partitions': true
        }
    )
}}
SELECT TIMESTAMP(CURRENT_TIMESTAMP()) as hour, *
FROM {{ ref('A') }}

Then let's update

{{
    config(
        materialized='table'
}}
SELECT "foo" as foo, "bar" as bar

Running again both models soon (< 1h for standard BQ edition, < 24h for enterprise edition) would fail on the model B execution saying that the insertion failed with:
Inserted row has wrong column count; Has 3, expected 2 at ...

Expected Behavior

You would expect the model B to be successful.

Now my gut feeling is that: we have this column scanning helper request that is requesting model B with a WHERE false LIMIT 0 which works great... but is apparently cached by BigQuery.
And usually BigQuery manages to figure out it should bust the cache... but practically, it isn't apparently.
BigQuery has an open issue for the problem: https://issuetracker.google.com/issues/343009002

So the fix would be to enforce NOT using the caching on that request which is pretty much free anyway.
I guess it should be reported to Google BigQuery team as well as it's a cache busting issue but I'm not sure how easy or hard it is to fix.

For company with BQ Enterprise editions, it's actually even more frustrating as the cache is 24h long.

So far, the workaround is either:

  • Not use a SELECT * so that downstream models are updated and the request wouldn't be cached
  • Rerun the query with WHERE false LIMIT 0 manually without the cache

It is at least affecting the time_ingestion_partitioning but maybe it could affect other modes for when we perform schema comparison for on_schema_change needs.

Steps To Reproduce

See "Current behavior"

It's also doable in pure SQL:

CREATE OR REPLACE TABLE `my_project.my_dataset.model_a`
AS select "foo" as foo

CREATE OR REPLACE TABLE `my_project.my_dataset.model_b`
AS select * from `my_project.my_dataset.model_a`

select * from `my_project.my_dataset.model_a` WHERE FALSE LIMIT 0

CREATE OR REPLACE TABLE `my_project.my_dataset.model_a`
AS select "foo" as foo, "bar" as bar

select * from `my_project.my_dataset.model_a` WHERE FALSE LIMIT 0

🔴 the result of the query is cached with the previous schema

Relevant log output

No response

Environment

- OS: MacOS
- Python: 3.11.9
- dbt-core: 1.8.2
- dbt-bigquery: 1.8.3

Additional Context

No response

@github-christophe-oudar github-christophe-oudar added bug Something isn't working triage labels Sep 3, 2024
@amychen1776
Copy link

@github-christophe-oudar have you reported this to the BQ team/opened up a support ticket with GCP? I'm not opposed to trying to resolve this on the adapter but it seems like the source is BQ and ideally it is resolved there.

@github-christophe-oudar
Copy link
Contributor Author

It's already opened on BQ side (for few months), you can see it on https://issuetracker.google.com/issues/343009002

@amychen1776
Copy link

@github-christophe-oudar yes - I also recommend opening up an actual ticket directly with your account team to push on this. Tickets are how we can push with our GCP contacts. Since this is directly related to a GCP issue, I'm going to close this ticket for now (appreciate you opening up this issue to flag it to us! We are also going to share it with our GCP contacts too)

@amychen1776 amychen1776 closed this as not planned Won't fix, can't repro, duplicate, stale Sep 9, 2024
@github-christophe-oudar
Copy link
Contributor Author

I'll see what I can do but then we don't have any control on the ETA for a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants