Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Implement batch metadata freshness using INFORMATION_SCHEMA.TABLE_STORAGE instead of client.get_table #1239

Open
3 tasks done
jtcohen6 opened this issue May 15, 2024 · 3 comments
Labels
enhancement New feature or request help_wanted Extra attention is needed

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented May 15, 2024

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Move from running a query per table to performing a batch operation.

Running a single query against INFORMATION_SCHEMA.TABLE_STORAGE appears to be the most efficient way to calculate freshness for all sources in a project that do not define a custom loaded_at_field. Evaluate if there is a python client API or if we need to run the query ourselves.

Copying from #938 (comment):

I did some further spiking using the and after some tweaking I think we have a promising path forward in dbt-bigquery, without requiring any changes in the base dbt-adapters spec or dbt-core.

An initial naive implementation of bigquery__get_relation_last_modified (3b28625), following the same approach we take in dbt-snowflake and dbt-redshift was an improvement on the current SDK-based implementation, however the query time was still to scaling linearly in proportion to the number of selected sources. From some simple testing using a project with 50 sources, the batch implementation took ~9s, while the non-batch SDK took ~20s. With 100 sources, the batch implementation took ~12.5s, while the non-batch SDK took ~28s.

However, filtering within the metadata macro by relation schema name in addition to filtering for exact schema & table name matches (51243e5) does the trick in getting the query time constant! Both a project with 50 sources and 100 sources took just under 5s to complete dbt source freshness.

And from #938 (comment):

Other considerations for this approach:

  • BigQuery has a character length limit for queries, and so templating many thousands of source table names into the query risks hitting this limit. (This is still better than loading tons of metadata into memory about table we don't care about, as a result of not pushing the predicates down into BQ.) We don't have an existing pattern for running parametrized queries.
  • Querying INFORMATION_SCHEMA.TABLE_STORAGE is not free. In our spike, with a few thousand sources, each query scanned 2GB — not a lot, but potentially more in very large datasets. By comparison, the current approach in main (Python SDK) is free.
  • I believe this would require one information schema query per BQ project configured with sources (not once per region). We have a (slightly janky) existing pattern for interpolating region into the query, which we can pursue if we're looking to put this code into production. It's not necessary for the investigation to answer the question, "Is this substantially better?"

Describe alternatives you've considered

Existing implementation: non-batch, leveraging BigQuery API. This is likely preferable in cases where a project contains only a few sources.

Who will this benefit?

BigQuery users with lots of sources, who want to calculate freshness for them all at once

Are you interested in contributing this feature?

No response

Anything else?

Spike:

@jtcohen6 jtcohen6 added the enhancement New feature or request label May 15, 2024
@adamcunnington-mlg
Copy link

We tested the branch (pip install git+https://github.com/dbt-labs/dbt-bigquery.git@batch-metadata-benchmarking#egg=dbt-bigquery) and can confirm a fantastic improvement!
image

1156 sources completed in 35 seconds. I ran it twice more and time was very similar. Current time spans between 4 minutes and 12 minutes due to query volatility - but new solution reduces to 1 query which not only is a massive speed up but a huge reduction in volatility. This is a big win.

The only thing I've not yet fully validated is the accuracy of the reported time from the INFORMATION_SCHEMA table. BQ docs state that "The data in the INFORMATION_SCHEMA.TABLE_STORAGE view is not kept in real time, and updates are typically delayed by a few seconds to a few minutes". We use the source_status:fresher+ selector in our build jobs so the impact of this is we might not catch things that have changed immediately but they'd be picked up by our subsequent run - so potentially a small latency impact but a reasonable price for the improvement.

@adamcunnington
Copy link

@jtcohen6 @MichelleArk please can you remind me what is waiting for this feature to be merged?

@tanukifk
Copy link

tanukifk commented Nov 2, 2024

Hi, I've found that the existing client.get_table method does not work correctly when a source name contains a wildcard, as mentioned in #1363. I believe this enhancement can fix the problem and should get more attention. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help_wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants