docs(dbt): start guidelines

gip-inclusion · Sep 27, 2024 · 988b07e · 988b07e
1 parent 4560db2
commit 988b07e
Show file tree

Hide file tree

Showing 2 changed files with 70 additions and 23 deletions.
diff --git a/pipeline/CONTRIBUTING.md b/pipeline/CONTRIBUTING.md
@@ -12,19 +12,12 @@ pip install -U pip setuptools wheel
 
 # Install the dev dependencies
 pip install -r requirements/dev/requirements.txt
-```
-
-## Running the test suite
 
-```bash
-# Copy (and optionally edit) the template .env
-cp .template.env .env
-
-# simply use tox (for reproducible environnement, packaging errors, etc.)
-tox
+# Install dbt
+pip install -r requirements/tasks/dbt/requirements.txt
 ```
 
-## dbt
+## Running `dbt`
 
 * dbt is configured to target the `target-db` postgres container (see the root `docker-compose.yml`).
 * all dbt commands must be run in the in the `pipeline/dbt` directory.
@@ -44,28 +37,20 @@ dbt run-operation create_udfs
 # run commands
 dbt ls
 
-# staging, basic processing/mapping:
-# - retrieve data from datalake table
-# - retrieve data from raw dedicated source tables
-# - retrieve data from the Soliguide S3
-dbt run --select staging
-
-# intermediate, specific transformations
-dbt run --select intermediate
-
-# marts, last touch
-dbt run --select marts
+dbt build --select models/staging
+dbt build --select models/intermediate
+dbt build --select models/marts
 ```
 
-## Update schema in dbt seeds
+## Updating schema in dbt seeds
 
 * Required when the schema changes.
 
 ```bash
 python scripts/update_schema_seeds.py
 ```
 
-## Manage the pipeline requirements
+## Managing the pipeline requirements
 
 In order to prevent conflicts:
 
@@ -84,3 +69,13 @@ make all
 # to upgrade dependencies
 make upgrade all
 ```
+
+## Running the test suite
+
+```bash
+# Copy (and optionally edit) the template .env
+cp .template.env .env
+
+# simply use tox (for reproducible environnement, packaging errors, etc.)
+tox
+```
diff --git a/pipeline/dbt/CONTRIBUTING.md b/pipeline/dbt/CONTRIBUTING.md
@@ -0,0 +1,52 @@
+# `dbt` guidelines
+
+## testing models
+
+#### `data_tests` vs `unit_tests` vs `contract`:
+
+* with `dbt build`, `data_tests` are run **after** model execution **on the actual data**. A failing test will not prevent the faulty data to be propagated downstream, unless properly managed by the orchestration.
+* with `dbt build`, `unit_tests` are run **before** model execution **on mock-up data**. This is great to test logic, but requires to make assumptions on the input data.
+* `contract`s are enforced using actual DB constraints, **on the actual data**. A failing constraint will stop the model execution and prevent faulty data to be propagated downstream. Unlike `data_tests`, we cannot set a severity level. There is no middle ground. And the faulty data cannot be easily queried.
+
+✅ use `unit_tests` to test **complex logic** on well-defined data (e.g. converting opening hours).
+
+❌ avoid `unit_tests` for simple transformations. There are costly to maintain and will very often just duplicate the implementation.
+
+✅ always add a few `data_tests`.
+
+✅ use `contract`s on `marts`. Marts data can be consumed by clients.
+
+#### which layer (`source`, `staging`, `intermediate`, `marts`) should I test ?
+
+It's better to test data early, so we can make assumption on which which we can later build.
+
+Our `source` layer is essentially tables containing the raw data in jsonb `data` columns. While this is very handy to load data, it is unpractical to test with `data_tests`.
+
+Therefore our tests start at the `staging` layer.
+
+✅ `staging`: use `data_tests` extensively. Assumptions on data made in downstream models should be tested.
+
+✅ `intermediate`: use `data_tests` for primary keys and foreign keys. Use the generic tests `check_structure`, `check_service` and `check_address`.
+
+✅ `marts`: use `contracts` + generic tests `check_structure`, `check_service` and `check_address`.
+
+#### which type of `data_tests` should I use ?
+
+* to stay manageable, our tests should be more or less uniform across the codebase.
+
+✅ always use native `unique` and `not_null` for primary keys.
+
+✅ always use `relationships` for foreign keys.
+
+✅ use `not_null`, `dbt_utils.not_empty_string` and `dbt_utils.not_constant` when possible.
+
+✅ use `accepted_values` for categorical columns from well-defined data.
+
+❌ avoid `accepted_values` for categorical columns of less than great data, or downgrade the test severity to `warn`. Otherwise the test could fail too regularly.
+
+✅ For simple cases, use predefined generic data tests over custom data tests (in `tests/`). Usually requires less code and is easier to read, *unless* you want to test complex logic.
+
+## references
+
+* https://www.datafold.com/blog/7-dbt-testing-best-practices
+* https://docs.getdbt.com/best-practices