-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dbt setup #4011
base: main
Are you sure you want to change the base?
Conversation
environments/conda-linux-64.lock.yml
Outdated
- importlib-metadata=8.5.0=pyha770c72_1 | ||
- importlib-metadata=6.10.0=pyha770c72_0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason dbt-semantic-interfaces
is stuck between 6 and 7.
- isodate=0.7.2=pyhd8ed1ab_1 | ||
- isodate=0.6.1=pyhd8ed1ab_0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dbt-common
is stuck between 0.6 and 0.7.
src/pudl/dbt/models/schema.yml
Outdated
tables: | ||
- name: out_eia923__boiler_fuel | ||
- name: out_eia923__monthly_boiler_fuel | ||
- name: out_ferc1__yearly_steam_plants_fuel_by_plant_sched402 | ||
- name: out_vcerare__hourly_available_capacity_factor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a data warehouse with hundreds of tables, would this file be created and managed by hand? Or would there be some rule-based way to generate it, or parts of it, along the lines of what we're doing with the Pandera schema checks right now? For example, the not_null
tests here are a 2nd place that that restriction is being specified -- it's already present in our table metadata, which seems like recipe for them getting out of sync.
Or in the case of row counts, is there a clean, non-manual way to update the row counts to reflect whatever the currently observed counts are? Especially if we're trying to regenerate expected row counts for each individual year, filling it all in manually could be pretty tedious and error prone. We've moved toward specifying per-year row counts on the newer assets so that they work transparently in either the fast or full ETL cases, and the asset checks don't need to be aware of which kind of job they're being run in, which seems both more specific and more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the "X column is not null" checks are currently defined in fields.py
under the field constraints, is that what you're thinking about?
I think it would be nice to have auto-generated tests like the non-null tests & row counts defined alongside manually added tests. Then all the tests will be defined in one place, except for the tests that we need to write custom Python code for.
That seems pretty doable - YAML is easy to work with, and dbt lets us tag tests, so we could easily tag all the auto-generated tests so our generation scripts know to replace them but leave the manually-added tests alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the field specific constraints I think we automatically add NOT NULL
check constraints to the PK fields when we construct the SQLite database -- but more generally I'm just saying that we need to get all of these generated tests integrated non-duplicatively into the dbt
tests somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems totally possible to auto-generate tests, but I think there's also probably many ways to do accomplish this, so we should figure out what we want from it. For example, when we talk about auto-generating row count/not null tests, will these be generated once and committed into the repo, or will some/all of them be dynamically generated at runtime?
It definitely seems tricky to minimize duplication between dbt
/our existing python schema info. I also wonder how this plays into any refactoring of our metadata system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like we may need to clearly define the data tests that are ready to be migrated in a straightforward way, and the things that still need design work, so we can point Margay folks at the stuff that's ready to go and keep thinking about the things that still need some scaffolding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do end up needing to define these intermediate tables it seems like we would want to have some kind of clear naming convention for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that seems like a good idea. Maybe just use a validation_
prefix and otherwise follow existing naming conventions?
src/pudl/dbt/models/schema.yml
Outdated
- dbt_expectations.expect_compound_columns_to_be_unique: | ||
column_list: ["county_id_fips", "datetime_utc"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be generated based on the PK that's defined for every table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be possible. We can also probably come up with a way to generate foreign key checks so we can actually verify foreign keys for tables only in parquet
src/pudl/dbt/models/schema.yml
Outdated
- dbt_expectations.expect_table_row_count_to_equal: | ||
value: | | ||
{%- if target.name == "etl-fast" -%} 27287400 | ||
{%- else -%} 136437000 | ||
{%- endif -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a clean way to specify the expected row counts for each year of data (or some other meaningful subset) within a table, as we've started doing for the newer assets in Dagster asset checks, so we don't have to differentiate between fast and full validations, and can identify where the changes are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd probably need to create a custom macro for this, but that seems totally doable. Big question is how we want to generate/store all of those tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The row count tests have functionally become regression tests -- we want to know when they change, and verify that the magnitude and nature of the change is expected based on the code or data that we've changed. Given that there are hundreds of tables (and thousands of table-years) it doesn't seem practical to hand-code all of the expected row counts.
It would be nice to have the per table-year row counts stored in (say) YAML somewhere, and be able to generate a new version of that file based on current ETL outputs. Then we could look at the diffs between the old and the new versions of the file when trying to assess changes in the lengths of the outputs.
src/pudl/dbt/models/schema.yml
Outdated
- dbt_expectations.expect_column_quantile_values_to_be_between: | ||
quantile: 0.05 | ||
min_value: 1.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing these are not using the weighted quantiles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this are just basic quantiles. It's not too hard to get a sql
query that can do a version of weighted quantiles, but the existing vs_historical
tests are hard because they're computing a bunch of quantiles, then comparing them all
dbt/README.md
Outdated
### Adding new tables | ||
To add a new table to the project, you must add it as a | ||
[dbt source](https://docs.getdbt.com/docs/build/sources). The standard way to do | ||
this is to create a new file `models/{data_source}/{table_name}.yml`. If the the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zschira I am confused by the difference between this instruction that tells users to add the table schema as an individual yml
file with the table name and the instructions below for adding tests, where you mention a schema.yml
file that would presumably contain all models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're close to being able to convert all the existing tests using the validate.py
framework. It seems like there are currently 3 categories of tests in this framework:
vs_historical
/vs_self
tests, which we've decided are out-of-scope and could be replaced with simpler tests. @zaneselvans we should decide what to do with these tests- Bounds checks on un-weighted quantiles. These are immediately convertable using
dbt_expectations
- Bounds checks on weighted quantiles. These can be converted if we get the
weighted_quantile.sql
macro working
Beyond this, the script I've added to generate scaffolding for new tables will add row count checks per partition. If there's any other checks we want to autogenerate from our metadata we should decide on that now.
@@ -0,0 +1,267 @@ | |||
"""A basic CLI to autogenerate dbt yml.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the script to add a new table to the project. It will generate a file called dbt/models/{data_source}/{table_name}/schema.yml
that defines the basic schema for a table for dbt and adds a test for row counts per partition. I've added some documentation for usage in dbt/README.md
.
We should decide if there's any other tests we want to automatically add based on our metadata at this point.
@@ -0,0 +1,26 @@ | |||
{% macro weighted_quantile(model, column_name, weight_col, quantile) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marianneke this is the macro that's currently broken. It gets called in expect_column_weighted_quantile_values_to_be_between.sql
. The idea is that this will compute a weighted quantile on a table, then that test will check the quantile vs bounds. I think the issue is with the syntax using WITH
here. The macro dbt/macros/check_row_counts_per_partition.sql
has an example where I fixed this to use the proper duckdb syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what the syntax issue was and how you solved it in the row count macro?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I'm actually not sure that it was the issue after all. I was getting a similar error message error message with the row count macro and I did have a syntax error there, but this looks ok to me. It might also be something to do with the expect_column_weighted_quantile_values_to_be_between
macro which is calling this under the hood.
@@ -0,0 +1,18 @@ | |||
{% test check_row_counts_per_partition(model, table_name, partition_column) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compares row counts in the table to expected counts which are stored in seeds/row_counts.csv
@@ -0,0 +1,28 @@ | |||
version: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zaneselvans do you know why out_eia923__boiler_fuel
isn't using the monthly
naming convention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 3 out_eia923__*_boiler_fuel
tables -- this one hasn't been aggregated by month, so there are probably some cases in which there are multiple records for a single month that get combined in the __monthly
version. But I'd need to compare the monthly and unaggregated tables to be sure. I know there's at least one case in the generation tables where the monthly and unaggragated tables are literally identical.
Overview
This PR sets up a
dbt
project within the PUDL repo that will be used for data testing. Details on setup and usage can all be found in theReadme.md
. This PR also includes several data validations that have been converted todbt
tests. The tests currently converted are allvcerare
asset_check
s and FERC fuel by plant cost per mmbtu range checks.In scope
validate.py
for any other obvious candidatesreadme
with directions for migration (how to add new asset, types of test in scope for immediate migration)Out of scope
vs_historical
closes #3997 #3971