Skip to content

Commit

Permalink
merge original dbt-dry-run updated code into migo (#3)
Browse files Browse the repository at this point in the history
* Add support for INTERVAL column type

* Add support for JSON column type

* Add integration test for column types

* Improve schema parsing from BigQuery client (autotraderuk#29)

* Update release process

* Release v0.6.3

* Support for dbt v1.4 (autotraderuk#30)

* Upgrade to dbt 1.4

* Update CHANGES.md

* Release v0.6.4

* Add --skip-not-compiled flag (autotraderuk#32)

* Add --skip-not-compiled flag

* Release v0.6.5

* Add --version parameter to print dry run version

* Support python 3.10

* Update poetry.lock

* Ignore more deprecation warnings

* Ignore invalid escape sequences

* Integration tests for incremental models

* Support Python 3.11 (autotraderuk#37)

Had to ignore deprecation warning of "cgi" core module
being use by google-cloud-storage (transitive dependency)

Integration and unit tests ran and are passing

* Add extra-check-columns-metadata-key option (autotraderuk#36)

* Release 0.6.6

Minor improvements and new CLI options

* Support dbt 1.5 (autotraderuk#40)

* Upgrade dbt to 1.5 and fix failing tests

* target-path in project has been deprecated

* Add --threads override option

* Release v0.6.7

* Add compatibility with dbt 1.6rc1

* Update to 1.6.0

* Release 0.6.8

* --full-refresh and --target-path CLI flags support (autotraderuk#44)

* add support for cli flag --full-refresh
expose it as a global flag
get predicted/model schema for full-refresh nodes

* wire dbt --target-path cli flag
allows integration tests to have multiple project contexts running at the same time without conflicting targets

* add full_refresh support derived from dbt model spec as well

* test full refresh precedence between cli flag and model config

* verify and update readme

* rename integration tests to make it clearer

* refactor full refresh precedence to match dbt docs definition

* update lock file and changes.md

* Release 0.7.0

* Refactor model runner to split by materialization

* Check incremental data types are compatible (autotraderuk#45)

* Extra dry run to verify type compatibility
* Refactor incremental runner unit tests
* Struct integration test

* Add changelog

* Release v0.7.1

* Fix run-integration.sh writing to wrong target

* Use column_types config for seeds (autotraderuk#46)

* Use adapter to convert agate types for seeds

* Print schema if node success when failure expected

* Load `column_types` when dry running seeds

* Add changelog

* Release v0.7.2

* Remove `columns` schema redundancy for external sources (autotraderuk#47)

* Respect existing column ordering for incremental models (autotraderuk#50)

* Don't run merge if incremental has recursive CTES (autotraderuk#51)

* Collate changes for 0.7.3

* Release v0.7.3

* fix false failure when require partition filter  (autotraderuk#56)

fix filtered_partition_date

Co-authored-by: Maliek Borwin <[email protected]>

* Changes for v0.7.4

* Release v0.7.4

* Fix problem where sql_header interacts with merge

* Release v0.7.5

* merge origin dbt-dry-run updated code into migo  dbt-dry-run

* modify pyproject.toml pydantic dependency version to at least 1.10.8

---------

Co-authored-by: Philippa Main <[email protected]>
Co-authored-by: connor-charles <[email protected]>
Co-authored-by: Connor Charles <[email protected]>
Co-authored-by: zachary-povey <[email protected]>
Co-authored-by: Connor Charles <[email protected]>
Co-authored-by: Angelos Georgiadis <[email protected]>
Co-authored-by: Angelos Georgiadis <[email protected]>
Co-authored-by: bokhi <[email protected]>
Co-authored-by: malik016 <[email protected]>
Co-authored-by: Maliek Borwin <[email protected]>
Co-authored-by: bruce_huang <[email protected]>
  • Loading branch information
12 people authored Feb 29, 2024
1 parent 7c1ed1d commit 7e0ee0e
Show file tree
Hide file tree
Showing 94 changed files with 4,493 additions and 1,745 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -139,3 +139,4 @@ integration-svc.json
# dbt
logs
integration/profiles/.user.yml
target-full-refresh/
88 changes: 88 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,93 @@
## Changelog

# dbt-dry-run v0.7.5

## Bugfixes

- Fix issue with incremental models where `sql_header` is set

# dbt-dry-run v0.7.4

## Bugfixes

- Fix false failure when incremental models use `require_partition_filter=True`

# dbt-dry-run v0.7.3

## Bugfixes

- Incremental models now correctly predict the column order if the table already exists in the target environment
- External tables no longer always require defining the schema twice in the YAML if the table source allows it
- Incremental models no longer cause a syntax error when they use `with recursive` CTEs

# dbt-dry-run v0.7.2

## Bugfixes

- Seed files now get their schema using type inference from the adapter so always line up with what dbt will produces
- Seed file `column_type` configuration is respected

# dbt-dry-run v0.7.1

## Bugfixes

- Fix dry runner falsely reporting success if incremental has incompatible type change for existing column

# dbt-dry-run v0.7.0

## Improvements

- Adds `--full-refresh` support. Dry running with full refresh will make use of predicted schema. This option aligns with the dbt cli
- Adds `--target-path` support. This option aligns with the dbt cli

# dbt-dry-run v0.6.8

- Compatibility with dbt v1.6

# dbt-dry-run v0.6.7

- Compatibility with dbt v1.5

- Adds `--threads` option as an override

# dbt-dry-run v0.6.6

## Bugfixes & Improvements

- Added `--extra-check-columns-metadata-key` CLI option. Specifying this will mean that you can use another metadata
key instead of just `dry_run.check_columns`. `dry_run.check_columns` will always take priority over the extra key.
This is useful if you have an existing metadata key such as `governance.is_odp` that you want to enable metadata
checking for

- Added `--version` CLI option to print the installed version of `dbt-dry-run`

- Added support for Python 3.11 ([zachary-povey](https://github.com/zachary-povey))

# dbt-dry-run v0.6.5

## Bugfixes & Improvements

- Added command line flag `--skip-not-compiled` which will override the default behaviour of raising a `NotCompiledExceptipon`
if a node is in the manifest that should be compiled. This should only be used in certain circumstances where you want
to skip an entire section of your dbt project from the dry run. Or if you don't want to dry run tests

- Added `status` to the report artefact which can be `SUCCESS`, `FAILED`, `SKIPPED`

# dbt-dry-run v0.6.4

## Bugfixes & Improvements

- Add support for dbt 1.4

# dbt-dry-run v0.6.3

## Bugfixes & Improvements

- Add support for INTERVAL and JSON types.

- Improved error handling when parsing the predicted schema of the dry run queries. Error message will now raise an
`UnknownSchemaException` detailing the field type returned by BigQuery that it does not recognise

# dbt-dry-run v0.6.2

## Bugfixes
Expand Down
46 changes: 37 additions & 9 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,27 @@
# Contributing/Running locally

To setup a dev environment you need [poetry][get-poetry], first run `poetry install` to install all dependencies. Then
the `Makefile` contains all the commands needed to run the test suite and linting.

- verify: Formats code with `black`, type checks with `mypy` and then runs the unit tests with coverage.
- integration: Runs the integration tests against BigQuery (See Integration Tests)

There is also a shell script `./run-integration.sh <PROJECT_DIR>` which will run one of the integration tests locally.
Where `<PROJECT_DIR>` is one of the directory names in `/integration/projects/`. (See Integration Tests)

## Running Integration Tests

In order to run integration tests locally you will need access to a BigQuery project/instance in which your gcloud
application default credentials has the role `Big Query Data Owner`. The BigQuery instance should have an empty dataset
called `dry_run`.

Setting the environment variable `DBT_PROJECT=<YOUR GCP PROJECT HERE>` will tell the integration tests which GCP project
to run the test suite against. The test suite does not currently materialize any data into the project.

The integration tests will run on any push to `main` to ensure the package's core functionality is still in place.

__Auto Trader employees can request authorisation to access the `at-dry-run-integration-dev` project for this purpose__

# Preparing for a Release

## Bump Version
Expand Down Expand Up @@ -31,15 +55,19 @@ Currently, we are using the below format for each release:

## Releasing to PyPi

Currently, the github action setup in `.github/workflows/main.yml` does not work to automatically release to PyPi
on tag. Therefore, you should release the package locally. This can be done in the Makefile by following these steps:

1. Check the `CHANGES.md` has been updated with all the commits/PRs from the last tagged release

2. Check the version has been bumped and the version in `CHANGES.md` and `pyproject.toml` are consistent
Currently, the github action setup in `.github/workflows/main.yml` automatically release to PyPi on tag. The procedure
for releasing should be:

3. Check the last github action on `main` was green
1. Check the last github action on `main` was green

4. Run `make build`
2. Check the `CHANGES.md` has been updated with all the commits/PRs from the last tagged release. Decide what version
the new release should be, this project roughly follows SemVer

5. Run `make release`
3. Make and push a version bump commit which increases the version in `pyproject.toml` (Ensure this is consistent with the latest
version in `CHANGES.md`)

4. Once the version bump commit GH action is green then tag this commit with the same version prefix with `v` so the tag
name should be `vX.X.X`. Once this tag is pushed another GH action will start which will release the package version
to production PyPI

5. Verify the package is released
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ mypy:
.PHONY: format
format:
black dbt_dry_run
black integration
isort dbt_dry_run

.PHONY: verify
Expand Down
122 changes: 64 additions & 58 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,16 @@ pip install dbt-dry-run
### Running

The dry runner has a single command called `dbt-dry-run` in order for it to run you must first compile a dbt manifest
using `dbt compile` as you normally would.
using `dbt compile`.

<details>
<summary>How much of the project should I compile?</summary>
It is best practice to compile the entire dbt project when supplying a manifest for dry run. The
dry run loops through your project in the DAG order (staging -> intermediate -> mart) based on `ref` and predicts the
schema of each model as it progresses. If you dry run `marts` but have not compiled `staging` then it cannot
determine if `marts` will run as it does not know the predicted schema of the upstream models and you will see
`NotCompiledException` in the dry run output.
</details>

Then on the same machine (So that the dry runner has access to your dbt project source and the
`manifest.yml`) you can run the dry-runner in the same directory as our `dbt_project.yml`:
Expand All @@ -42,26 +51,44 @@ The full CLI help is shown below, anything prefixed with [dbt] can be used in th

```
❯ dbt-dry-run --help
Usage: dbt-dry-run [OPTIONS] [PROFILE]
Options:
--profiles-dir TEXT [dbt] Where to search for `profiles.yml`
[default: /Users/<user>/.dbt]
--project-dir TEXT [dbt] Where to search for `dbt_project.yml`
[default: /Users/<user>/Code/dbt-
dry-run]
--tags TEXT [dbt] tags need to dry run, eg: tag1,tag2
--vars TEXT [dbt] CLI Variables to pass to dbt
--target TEXT [dbt] Target profile
--verbose / --no-verbose Output verbose error messages [default: no-
verbose]
--report-path TEXT Json path to dump report to
--install-completion [bash|zsh|fish|powershell|pwsh]
Install completion for the specified shell.
--show-completion [bash|zsh|fish|powershell|pwsh]
Show completion for the specified shell, to
copy it or customize the installation.
--help Show this message and exit.
Usage: dbt-dry-run [OPTIONS]
Options:
--profiles-dir TEXT [dbt] Where to search for `profiles.yml`
[default: /Users/connor.charles/.dbt]
--project-dir TEXT [dbt] Where to search for `dbt_project.yml`
[default: /Users/connor.charles/Code/dbt-
dry-run]
--vars TEXT [dbt] CLI Variables to pass to dbt
[default: {}]
--target TEXT [dbt] Target profile
--target-path TEXT [dbt] Target path
--verbose / --no-verbose Output verbose error messages [default: no-
verbose]
--report-path TEXT Json path to dump report to
--skip-not-compiled Whether or not the dry run should ignore
models that are not compiled. This has
several caveats that make this not a
recommended option. The dbt manifest should
generally be compiled with `--select *` to
ensure good coverage
--full-refresh [dbt] Full refresh
--extra-check-columns-metadata-key TEXT
An extra metadata key that can be used in
place of `dry_run.check_columns` for
verifying column metadata has been specified
correctly. `dry_run.check_columns` will
always take precedence. The metadata key
should be of boolean type or it will be cast
to a boolean to be 'True/Falsey`
--version
--install-completion [bash|zsh|fish|powershell|pwsh]
Install completion for the specified shell.
--show-completion [bash|zsh|fish|powershell|pwsh]
Show completion for the specified shell, to
copy it or customize the installation.
--help Show this message and exit.
--tags TEXT [dbt] tags need to dry run, eg: tag1,tag2
```

## Reporting Results & Failures
Expand Down Expand Up @@ -92,12 +119,13 @@ DRY RUN FAILURE!`

The process will also return exit code 1

### Column and Metadata Linting (Experimental!)
### Column and Metadata Linting

The dry runner can also be configured to inspect your metadata YAML and assert that the predicted schema of your dbt
projects data warehouse matches what is documented in the metadata. To enable this for your models specify the key
`dry_run.check_columns: true`. The dry runner will then fail if the model's documentation does not match. For example
the full metadata for this model:
`dry_run.check_columns: true`. The dry runner will then fail if the model's documentation does not match. You can also
specify a custom extra key to enable `check_columns` by setting the CLI argument `--extra-check-columns-metadata-key`.
For example the full metadata for this model:

```yaml
models:
Expand Down Expand Up @@ -140,15 +168,16 @@ Currently, these rules can cause linting failures:
2. EXTRA_DOCUMENTED_COLUMNS: The predicted schema of the model does not have this column that was specified in the
metadata

This could be extended to verify that datatype has been set correctly as well or other linting rules such as naming
conventions based on datatype.

### Usage with dbt-external-tables

The dbt package [dbt-external-tables][dbt-external-tables] gives dbt support for staging and managing
[external tables][bq-external-tables]. These sources do not produce any compiled sql in the manifest, so it is not
possible for the dry runner to predict their schema. Therefore, you must specify the resulting schema manually in the
metadata of the source. For example if you were import data from a gcs bucket:
metadata of the source.

However, if the `columns` schema is already defined under the `name` in the yaml config, you do not need to specify `dry_run_columns` under `external`. The dry runner will use the `columns` schema if `dry_run_columns` is not specified. This avoids duplicated schema definitions.

For example if you were import data from a gcs bucket:

```yaml
version: 2
Expand Down Expand Up @@ -196,7 +225,8 @@ information of each node's predicted schema or error message if it has failed:
"nodes": [
{
"unique_id": "seed.test_models_with_invalid_sql.my_seed",
"success": true,
"success": true,
"status": "SUCCESS",
"error_message": null,
"table": {
"fields": [
Expand All @@ -207,6 +237,7 @@ information of each node's predicted schema or error message if it has failed:
{
"unique_id": "model.test_models_with_invalid_sql.first_layer",
"success": true,
"status": "SUCCESS",
"error_message": null,
"table": {
"fields": [
Expand All @@ -217,37 +248,14 @@ information of each node's predicted schema or error message if it has failed:
{
"unique_id": "model.test_models_with_invalid_sql.second_layer",
"success": false,
"status": "FAILURE",
"error_message": "BadRequest",
"table": null
}
]
}
```

## Contributing/Running locally

To setup a dev environment you need [poetry][get-poetry], first run `poetry install` to install all dependencies. Then
the `Makefile` contains all the commands needed to run the test suite and linting.

- verify: Formats code with `black`, type checks with `mypy` and then runs the unit tests with coverage.
- integration: Runs the integration tests against BigQuery (See Integration Tests)

There is also a shell script `./run-integration.sh <PROJECT_DIR>` which will run one of the integration tests locally.
Where `<PROJECT_DIR>` is one of the directory names in `/integration/projects/`. (See Integration Tests)

### Running Integration Tests

In order to run integration tests locally you will need access to a BigQuery project/instance in which your gcloud
application default credentials has the role `Big Query Data Owner`. The BigQuery instance should have an empty dataset
called `dry_run`.

Setting the environment variable `DBT_PROJECT=<YOUR GCP PROJECT HERE>` will tell the integration tests which GCP project
to run the test suite against. The test suite does not currently materialize any data into the project.

The integration tests will run on any push to `main` to ensure the package's core functionality is still in place.

__Auto Trader employees can request authorisation to access the `at-dry-run-integration-dev` project for this purpose__

## Capabilities and Limitations

### Things this can catch
Expand Down Expand Up @@ -276,12 +284,10 @@ There are certain cases where a syntactically valid query can fail due to the da

### Things still to do...

Implementing the dry runner required re-implementing some areas of dbt. Mainly how the adapter sets up connections and
credentials with the BigQuery client, we have only implemented the methods of how we connect to our warehouse so if you
don't use OAUTH or service account JSON files then this won't be able to read `profiles.yml` correctly.
The implementation of seeds is incomplete as we don't use them very much in our own dbt projects. The dry runner
will just use the datatypes that `agate` infers from the CSV files. It will ignore any type overrides you add in the YAML.

The implementation of seeds is incomplete as well as we don't use them very much in our own dbt projects. The dry runner
will just use the datatypes that `agate` infers from the CSV files.
If you see anything else that you think it should catch don't hesitate to raise an issue!

[dbt-home]: https://www.getdbt.com/

Expand Down
Loading

0 comments on commit 7e0ee0e

Please sign in to comment.