merge original dbt-dry-run updated code into migo (#3)

* Add support for INTERVAL column type * Add support for JSON column type * Add integration test for column types * Improve schema parsing from BigQuery client (autotraderuk#29) * Update release process * Release v0.6.3 * Support for dbt v1.4 (autotraderuk#30) * Upgrade to dbt 1.4 * Update CHANGES.md * Release v0.6.4 * Add --skip-not-compiled flag (autotraderuk#32) * Add --skip-not-compiled flag * Release v0.6.5 * Add --version parameter to print dry run version * Support python 3.10 * Update poetry.lock * Ignore more deprecation warnings * Ignore invalid escape sequences * Integration tests for incremental models * Support Python 3.11 (autotraderuk#37) Had to ignore deprecation warning of "cgi" core module being use by google-cloud-storage (transitive dependency) Integration and unit tests ran and are passing * Add extra-check-columns-metadata-key option (autotraderuk#36) * Release 0.6.6 Minor improvements and new CLI options * Support dbt 1.5 (autotraderuk#40) * Upgrade dbt to 1.5 and fix failing tests * target-path in project has been deprecated * Add --threads override option * Release v0.6.7 * Add compatibility with dbt 1.6rc1 * Update to 1.6.0 * Release 0.6.8 * --full-refresh and --target-path CLI flags support (autotraderuk#44) * add support for cli flag --full-refresh expose it as a global flag get predicted/model schema for full-refresh nodes * wire dbt --target-path cli flag allows integration tests to have multiple project contexts running at the same time without conflicting targets * add full_refresh support derived from dbt model spec as well * test full refresh precedence between cli flag and model config * verify and update readme * rename integration tests to make it clearer * refactor full refresh precedence to match dbt docs definition * update lock file and changes.md * Release 0.7.0 * Refactor model runner to split by materialization * Check incremental data types are compatible (autotraderuk#45) * Extra dry run to verify type compatibility * Refactor incremental runner unit tests * Struct integration test * Add changelog * Release v0.7.1 * Fix run-integration.sh writing to wrong target * Use column_types config for seeds (autotraderuk#46) * Use adapter to convert agate types for seeds * Print schema if node success when failure expected * Load `column_types` when dry running seeds * Add changelog * Release v0.7.2 * Remove `columns` schema redundancy for external sources (autotraderuk#47) * Respect existing column ordering for incremental models (autotraderuk#50) * Don't run merge if incremental has recursive CTES (autotraderuk#51) * Collate changes for 0.7.3 * Release v0.7.3 * fix false failure when require partition filter (autotraderuk#56) fix filtered_partition_date Co-authored-by: Maliek Borwin <[email protected]> * Changes for v0.7.4 * Release v0.7.4 * Fix problem where sql_header interacts with merge * Release v0.7.5 * merge origin dbt-dry-run updated code into migo dbt-dry-run * modify pyproject.toml pydantic dependency version to at least 1.10.8 --------- Co-authored-by: Philippa Main <[email protected]> Co-authored-by: connor-charles <[email protected]> Co-authored-by: Connor Charles <[email protected]> Co-authored-by: zachary-povey <[email protected]> Co-authored-by: Connor Charles <[email protected]> Co-authored-by: Angelos Georgiadis <[email protected]> Co-authored-by: Angelos Georgiadis <[email protected]> Co-authored-by: bokhi <[email protected]> Co-authored-by: malik016 <[email protected]> Co-authored-by: Maliek Borwin <[email protected]> Co-authored-by: bruce_huang <[email protected]>
migocorp · Feb 29, 2024 · 7e0ee0e · 7e0ee0e
1 parent 7c1ed1d
commit 7e0ee0e
Show file tree

Hide file tree

Showing 94 changed files with 4,493 additions and 1,745 deletions.
diff --git a/.gitignore b/.gitignore
@@ -139,3 +139,4 @@ integration-svc.json
 # dbt
 logs
 integration/profiles/.user.yml
+target-full-refresh/
diff --git a/CHANGES.md b/CHANGES.md
@@ -1,5 +1,93 @@
 ## Changelog
 
+# dbt-dry-run v0.7.5
+
+## Bugfixes
+
+- Fix issue with incremental models where `sql_header` is set
+
+# dbt-dry-run v0.7.4
+
+## Bugfixes
+
+- Fix false failure when incremental models use `require_partition_filter=True`
+
+# dbt-dry-run v0.7.3
+
+## Bugfixes
+
+- Incremental models now correctly predict the column order if the table already exists in the target environment
+- External tables no longer always require defining the schema twice in the YAML if the table source allows it
+- Incremental models no longer cause a syntax error when they use `with recursive` CTEs
+
+# dbt-dry-run v0.7.2
+
+## Bugfixes
+
+- Seed files now get their schema using type inference from the adapter so always line up with what dbt will produces
+- Seed file `column_type` configuration is respected
+
+# dbt-dry-run v0.7.1
+
+## Bugfixes
+
+- Fix dry runner falsely reporting success if incremental has incompatible type change for existing column
+
+# dbt-dry-run v0.7.0
+
+## Improvements
+
+- Adds `--full-refresh` support. Dry running with full refresh will make use of predicted schema. This option aligns with the dbt cli 
+- Adds `--target-path` support. This option aligns with the dbt cli
+
+# dbt-dry-run v0.6.8
+
+- Compatibility with dbt v1.6
+
+# dbt-dry-run v0.6.7
+
+- Compatibility with dbt v1.5
+
+- Adds `--threads` option as an override
+
+# dbt-dry-run v0.6.6
+
+## Bugfixes & Improvements
+
+- Added `--extra-check-columns-metadata-key` CLI option. Specifying this will mean that you can use another metadata 
+  key instead of just `dry_run.check_columns`. `dry_run.check_columns` will always take priority over the extra key.
+  This is useful if you have an existing metadata key such as `governance.is_odp` that you want to enable metadata 
+  checking for
+
+- Added `--version` CLI option to print the installed version of `dbt-dry-run`
+
+- Added support for Python 3.11 ([zachary-povey](https://github.com/zachary-povey))
+
+# dbt-dry-run v0.6.5
+
+## Bugfixes & Improvements
+
+- Added command line flag `--skip-not-compiled` which will override the default behaviour of raising a `NotCompiledExceptipon`
+  if a node is in the manifest that should be compiled. This should only be used in certain circumstances where you want 
+  to skip an entire section of your dbt project from the dry run. Or if you don't want to dry run tests
+
+- Added `status` to the report artefact which can be `SUCCESS`, `FAILED`, `SKIPPED`
+
+# dbt-dry-run v0.6.4
+
+## Bugfixes & Improvements
+
+- Add support for dbt 1.4
+
+# dbt-dry-run v0.6.3
+
+## Bugfixes & Improvements
+
+- Add support for INTERVAL and JSON types.
+
+- Improved error handling when parsing the predicted schema of the dry run queries. Error message will now raise an
+  `UnknownSchemaException` detailing the field type returned by BigQuery that it does not recognise
+
 # dbt-dry-run v0.6.2
 
 ## Bugfixes

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,3 +1,27 @@
+# Contributing/Running locally
+
+To setup a dev environment you need [poetry][get-poetry], first run `poetry install` to install all dependencies. Then
+the `Makefile` contains all the commands needed to run the test suite and linting.
+
+- verify: Formats code with `black`, type checks with `mypy` and then runs the unit tests with coverage.
+- integration: Runs the integration tests against BigQuery (See Integration Tests)
+
+There is also a shell script `./run-integration.sh <PROJECT_DIR>` which will run one of the integration tests locally.
+Where `<PROJECT_DIR>` is one of the directory names in `/integration/projects/`. (See Integration Tests)
+
+## Running Integration Tests
+
+In order to run integration tests locally you will need access to a BigQuery project/instance in which your gcloud
+application default credentials has the role `Big Query Data Owner`. The BigQuery instance should have an empty dataset
+called `dry_run`.
+
+Setting the environment variable `DBT_PROJECT=<YOUR GCP PROJECT HERE>` will tell the integration tests which GCP project
+to run the test suite against. The test suite does not currently materialize any data into the project.
+
+The integration tests will run on any push to `main` to ensure the package's core functionality is still in place.
+
+__Auto Trader employees can request authorisation to access the `at-dry-run-integration-dev` project for this purpose__
+
 # Preparing for a Release
 
 ## Bump Version
@@ -31,15 +55,19 @@ Currently, we are using the below format for each release:
 
 ## Releasing to PyPi
 
-Currently, the github action setup in `.github/workflows/main.yml` does not work to automatically release to PyPi 
-on tag. Therefore, you should release the package locally. This can be done in the Makefile by following these steps:
-
-1. Check the `CHANGES.md` has been updated with all the commits/PRs from the last tagged release
-
-2. Check the version has been bumped and the version in `CHANGES.md` and `pyproject.toml` are consistent
+Currently, the github action setup in `.github/workflows/main.yml` automatically release to PyPi on tag. The procedure 
+for releasing should be:
 
-3. Check the last github action on `main` was green
+1. Check the last github action on `main` was green
 
-4. Run `make build`
+2. Check the `CHANGES.md` has been updated with all the commits/PRs from the last tagged release. Decide what version 
+   the new release should be, this project roughly follows SemVer
 
-5. Run `make release`
+3. Make and push a version bump commit which increases the version in `pyproject.toml` (Ensure this is consistent with the latest 
+   version in `CHANGES.md`)
+
+4. Once the version bump commit GH action is green then tag this commit with the same version prefix with `v` so the tag
+   name should be `vX.X.X`. Once this tag is pushed another GH action will start which will release the package version 
+   to production PyPI
+
+5. Verify the package is released
diff --git a/Makefile b/Makefile
@@ -18,6 +18,7 @@ mypy:
 .PHONY: format
 format:
 	black dbt_dry_run
+	black integration
 	isort dbt_dry_run
 
 .PHONY: verify

diff --git a/README.md b/README.md
@@ -22,7 +22,16 @@ pip install dbt-dry-run
 ### Running
 
 The dry runner has a single command called `dbt-dry-run` in order for it to run you must first compile a dbt manifest
-using `dbt compile` as you normally would.
+using `dbt compile`. 
+
+<details>
+  <summary>How much of the project should I compile?</summary>
+      It is best practice to compile the entire dbt project when supplying a manifest for dry run. The 
+      dry run loops through your project in the DAG order (staging -> intermediate -> mart) based on `ref` and predicts the 
+      schema of each model as it progresses. If you dry run `marts` but have not compiled `staging` then it cannot 
+      determine if `marts` will run as it does not know the predicted schema of the upstream models and you will see 
+      `NotCompiledException` in the dry run output.
+</details>
 
 Then on the same machine (So that the dry runner has access to your dbt project source and the
 `manifest.yml`) you can run the dry-runner in the same directory as our `dbt_project.yml`:
@@ -42,26 +51,44 @@ The full CLI help is shown below, anything prefixed with [dbt] can be used in th
 
 ```
   ❯ dbt-dry-run --help
-    Usage: dbt-dry-run [OPTIONS] [PROFILE]
-    
-    Options:
-      --profiles-dir TEXT             [dbt] Where to search for `profiles.yml`
-                                      [default: /Users/<user>/.dbt]
-      --project-dir TEXT              [dbt] Where to search for `dbt_project.yml`
-                                      [default: /Users/<user>/Code/dbt-
-                                      dry-run]
-      --tags TEXT                     [dbt] tags need to dry run, eg: tag1,tag2
-      --vars TEXT                     [dbt] CLI Variables to pass to dbt
-      --target TEXT                   [dbt] Target profile
-      --verbose / --no-verbose        Output verbose error messages  [default: no-
-                                      verbose]
-      --report-path TEXT              Json path to dump report to
-      --install-completion [bash|zsh|fish|powershell|pwsh]
-                                      Install completion for the specified shell.
-      --show-completion [bash|zsh|fish|powershell|pwsh]
-                                      Show completion for the specified shell, to
-                                      copy it or customize the installation.
-      --help                          Show this message and exit.
+   Usage: dbt-dry-run [OPTIONS]
+   
+   Options:
+     --profiles-dir TEXT             [dbt] Where to search for `profiles.yml`
+                                     [default: /Users/connor.charles/.dbt]
+     --project-dir TEXT              [dbt] Where to search for `dbt_project.yml`
+                                     [default: /Users/connor.charles/Code/dbt-
+                                     dry-run]
+     --vars TEXT                     [dbt] CLI Variables to pass to dbt
+                                     [default: {}]
+     --target TEXT                   [dbt] Target profile
+     --target-path TEXT              [dbt] Target path
+     --verbose / --no-verbose        Output verbose error messages  [default: no-
+                                     verbose]
+     --report-path TEXT              Json path to dump report to
+     --skip-not-compiled             Whether or not the dry run should ignore
+                                     models that are not compiled. This has
+                                     several caveats that make this not a
+                                     recommended option. The dbt manifest should
+                                     generally be compiled with `--select *` to
+                                     ensure good  coverage
+     --full-refresh                  [dbt] Full refresh
+     --extra-check-columns-metadata-key TEXT
+                                     An extra metadata key that can be used in
+                                     place of `dry_run.check_columns` for
+                                     verifying column metadata has been specified
+                                     correctly. `dry_run.check_columns` will
+                                     always take precedence. The metadata key
+                                     should be of boolean type or it will be cast
+                                     to a boolean to be 'True/Falsey`
+     --version
+     --install-completion [bash|zsh|fish|powershell|pwsh]
+                                     Install completion for the specified shell.
+     --show-completion [bash|zsh|fish|powershell|pwsh]
+                                     Show completion for the specified shell, to
+                                     copy it or customize the installation.
+     --help                          Show this message and exit.
+     --tags TEXT                     [dbt] tags need to dry run, eg: tag1,tag2
 ```
 
 ## Reporting Results & Failures
@@ -92,12 +119,13 @@ DRY RUN FAILURE!`
 
 The process will also return exit code 1
 
-### Column and Metadata Linting (Experimental!)
+### Column and Metadata Linting
 
 The dry runner can also be configured to inspect your metadata YAML and assert that the predicted schema of your dbt
 projects data warehouse matches what is documented in the metadata. To enable this for your models specify the key
-`dry_run.check_columns: true`. The dry runner will then fail if the model's documentation does not match. For example
-the full metadata for this model:
+`dry_run.check_columns: true`. The dry runner will then fail if the model's documentation does not match. You can also
+specify a custom extra key to enable `check_columns` by setting the CLI argument `--extra-check-columns-metadata-key`.
+For example the full metadata for this model:
 
 ```yaml
 models:
@@ -140,15 +168,16 @@ Currently, these rules can cause linting failures:
 2. EXTRA_DOCUMENTED_COLUMNS: The predicted schema of the model does not have this column that was specified in the
    metadata
 
-This could be extended to verify that datatype has been set correctly as well or other linting rules such as naming
-conventions based on datatype.
-
 ### Usage with dbt-external-tables
 
 The dbt package [dbt-external-tables][dbt-external-tables] gives dbt support for staging and managing
 [external tables][bq-external-tables]. These sources do not produce any compiled sql in the manifest, so it is not
 possible for the dry runner to predict their schema. Therefore, you must specify the resulting schema manually in the
-metadata of the source. For example if you were import data from a gcs bucket:
+metadata of the source. 
+
+However, if the `columns` schema is already defined under the `name` in the yaml config, you do not need to specify `dry_run_columns` under `external`. The dry runner will use the `columns` schema if `dry_run_columns` is not specified. This avoids duplicated schema definitions.
+
+For example if you were import data from a gcs bucket:
 
 ```yaml
 version: 2
@@ -196,7 +225,8 @@ information of each node's predicted schema or error message if it has failed:
   "nodes": [
     {
       "unique_id": "seed.test_models_with_invalid_sql.my_seed",
-      "success": true,
+      "success": true, 
+      "status": "SUCCESS",
       "error_message": null,
       "table": {
         "fields": [
@@ -207,6 +237,7 @@ information of each node's predicted schema or error message if it has failed:
     {
       "unique_id": "model.test_models_with_invalid_sql.first_layer",
       "success": true,
+      "status": "SUCCESS",
       "error_message": null,
       "table": {
         "fields": [
@@ -217,37 +248,14 @@ information of each node's predicted schema or error message if it has failed:
     {
       "unique_id": "model.test_models_with_invalid_sql.second_layer",
       "success": false,
+      "status": "FAILURE",
       "error_message": "BadRequest",
       "table": null
     }
   ]
 }
 ```
 
-## Contributing/Running locally
-
-To setup a dev environment you need [poetry][get-poetry], first run `poetry install` to install all dependencies. Then
-the `Makefile` contains all the commands needed to run the test suite and linting.
-
-- verify: Formats code with `black`, type checks with `mypy` and then runs the unit tests with coverage.
-- integration: Runs the integration tests against BigQuery (See Integration Tests)
-
-There is also a shell script `./run-integration.sh <PROJECT_DIR>` which will run one of the integration tests locally.
-Where `<PROJECT_DIR>` is one of the directory names in `/integration/projects/`. (See Integration Tests)
-
-### Running Integration Tests
-
-In order to run integration tests locally you will need access to a BigQuery project/instance in which your gcloud
-application default credentials has the role `Big Query Data Owner`. The BigQuery instance should have an empty dataset
-called `dry_run`.
-
-Setting the environment variable `DBT_PROJECT=<YOUR GCP PROJECT HERE>` will tell the integration tests which GCP project
-to run the test suite against. The test suite does not currently materialize any data into the project.
-
-The integration tests will run on any push to `main` to ensure the package's core functionality is still in place.
-
-__Auto Trader employees can request authorisation to access the `at-dry-run-integration-dev` project for this purpose__
-
 ## Capabilities and Limitations
 
 ### Things this can catch
@@ -276,12 +284,10 @@ There are certain cases where a syntactically valid query can fail due to the da
 
 ### Things still to do...
 
-Implementing the dry runner required re-implementing some areas of dbt. Mainly how the adapter sets up connections and
-credentials with the BigQuery client, we have only implemented the methods of how we connect to our warehouse so if you
-don't use OAUTH or service account JSON files then this won't be able to read `profiles.yml` correctly.
+The implementation of seeds is incomplete as we don't use them very much in our own dbt projects. The dry runner
+will just use the datatypes that `agate` infers from the CSV files. It will ignore any type overrides you add in the YAML.
 
-The implementation of seeds is incomplete as well as we don't use them very much in our own dbt projects. The dry runner
-will just use the datatypes that `agate` infers from the CSV files.
+If you see anything else that you think it should catch don't hesitate to raise an issue!
 
 [dbt-home]: https://www.getdbt.com/
-Original file line number
+Diff line change
@@ Expand Up / @@ -139,3 +139,4 @@ integration-svc.json @@
     # dbt
     logs
     integration/profiles/.user.yml
+    target-full-refresh/