Modify dbt test workflow and script to allow control over test selection and artifact output #603

jeancochrane · 2024-09-17T20:08:09Z

Background

This PR adjusts a few behaviors of the test-dbt-models workflow and its underlying script run_iasworld_data_tests that I've come to find frustrating in the process of developing data tests:

The script and the workflow always run all data tests, even when I only want to run a subset of them
The workflow doesn't implement deferral, meaning it can't run on a branch unless you manually create all of the missing nodes
The script and the workflow always parse run metadata to produce output artifacts, even when I don't need them; as a result, script/workflow runs can take ~20 minutes longer than they otherwise would
The workflow always uploads test result artifacts to S3, even when the output is not significant (that is, basically any run except for the sqoop-bot's runs); as a result, we use S3 storage space that we don't need

Changes

We implement the following changes to improve the developer experience of using the script and the workflow:

Allow the caller to select which tests will run
- Adds --select and --selector arguments to the run_iasworld_data_tests.py script that override the default --selector select_data_test_iasworld behavior
- Adds a select input variable to the test-dbt-models workflow that passes through to run_iasworld_data_tests.py
Support deferral so that the workflow can run on branches with no extra intervention
- Adds --defer and --state arguments to the run_iasworld_data_tests.py script that pass through to the underlying dbt test call
- Updates the test-dbt-models workflow to pull state from the remote cache and pass it through via --defer and --state on cache hit (similar to how the build-and-run-dbt workflow works)
Allow the caller to limit artifact creation and upload
- Adds a --skip-artifacts/--no-skip-artifacts boolean option to run_iasworld_data_tests.py with the following behavior:
  - --skip-artifacts: Exits once tests have completed
  - --no-skip-artifacts (default): Parses test metadata and produces output artifacts (workbook and Parquet files)
- Adds an upload_test_results input variable to the test-dbt-models workflow that controls whether --skip-artifacts or --no-skip-artifacts gets passed to the underlying script

Testing

See these three workflow runs for examples of each set of behaviors:

select="--select iasworld_htagnt_unique_by_agent" and upload_test_results=false: One test runs, no artifacts generated or uploaded
select="--selector data_test_iasworld" and upload_test_results=false: All tests run, no artifacts generated or uploaded
select="--selector data_test_iasworld" and upload_test_results=true: All tests run, artifacts generated and uploaded

…var is set

…t in `test-dbt-models`

… workflow to use them

…flow

…iles if they exist

jeancochrane · 2024-09-19T21:03:01Z

.github/workflows/test_dbt_models.yaml

+      select:
+        description: dbt select statement for tests to run
+        required: false
+        default: --selector select_data_test_iasworld


It's annoying that this design requires us to include the option name, but I think it's simpler to merge the --select and --selector at this level of abstraction rather than requiring the workflow caller to handle two separate variables. I'm open to other design suggestions if you have them, though!

I think it's better to have two different args and then throw an early error if they're both specified. To me that's less confusing, a bit more explicit, and less error prone.

Good point! I realized that the --selector argument is unnecessary anyway, since we only have one selector (select_data_test_iasworld) and we can just default to it if the select argument is empty. If we ever add more selectors it should be easy to add support for a --selector argument here, but in the meantime, one input variable that overrides the default selector with a custom --select statement feels a lot clearer to me.

jeancochrane · 2024-09-19T21:03:32Z

.github/workflows/test_dbt_models.yaml

+      upload_test_results:
+        description: Upload test results to S3
+        required: false
+        default: true


Setting this to true by default means that we don't have to update service-spark-iasworld to handle this change.

suggestion (non-blocking): I actually think it's better that we update service-spark-iasworld and have this off by default. Otherwise, won't we constantly be uploading new results every time we trigger this workflow? That might make things difficult if our goal is to check volume over time.

Otherwise, won't we constantly be uploading new results every time we trigger this workflow?

Not necessarily, unless workflow callers are leaving it checked. But I hear your point that it's a better nudge to leave it unchecked, and it's an easy change to update service-spark-iasworld to provide the argument anyway. I made that change and will put up a corresponding service-spark-iasworld PR.

jeancochrane · 2024-09-19T21:04:21Z

.github/workflows/test_dbt_models.yaml

+      - name: Restore dbt state cache
+        id: cache
+        uses: ./.github/actions/restore_dbt_cache
+        with:
+          path: ${{ env.PROJECT_DIR }}/${{ env.STATE_DIR }}
+          key: ${{ env.CACHE_KEY }}


I adapted this logic from the build-and-test-dbt workflow:

data-architecture/.github/workflows/build_and_test_dbt.yaml

Lines 35 to 40 in ac90562

- name: Restore dbt state cache

id: cache

uses: ./.github/actions/restore_dbt_cache

with:

path: ${{ env.PROJECT_DIR }}/${{ env.STATE_DIR }}

key: ${{ env.CACHE_KEY }}

jeancochrane · 2024-09-19T21:05:14Z

.github/workflows/test_dbt_models.yaml

+          if [[ $CACHE_HIT == 'true' ]]; then
+            # shellcheck disable=SC2086
+            python3 scripts/run_iasworld_data_tests.py \
+              --target "$TARGET" \
+              --output-dir ./qc_test_results/ \
+              $SELECT_OPTION \
+              $SKIP_ARTIFACTS_OPTION \
+              --defer --state "$STATE_DIR"
+          else
+            # shellcheck disable=SC2086
+            python3 scripts/run_iasworld_data_tests.py \
+              --target "$TARGET" \
+              --output-dir ./qc_test_results/ \
+              $SELECT_OPTION \
+              $SKIP_ARTIFACTS_OPTION
+          fi


It's annoying that these commands are identical, except that we add the --defer and --state flags in the case when CACHE_HIT is true. Is there a cleaner way to do this in bash that's still readable?

Maybe just do an inline conditional var that contains those options and is an empty string otherwise? Something like:

DEFER="$(if [[ $CACHE_HIT == 'true' ]]; then echo "--defer --state $STATE_DIR"; else echo ""; fi)" python3 scripts/run_iasworld_data_tests.py \ --target "$TARGET" \ --output-dir ./qc_test_results/ \ $SELECT_OPTION \ $SKIP_ARTIFACTS_OPTION \ $DEFER

Great idea, done!

jeancochrane · 2024-09-19T21:06:06Z

dbt/scripts/export_models.py

    parser.add_argument(
-        "--select",
-        required=False,
-        nargs="*",
-        help="One or more dbt select statements to use for filtering models",
+        *constants.SELECT_ARGUMENT_ARGS, **constants.SELECT_ARGUMENT_KWARGS
    )
    parser.add_argument(
-        "--selector",
-        required=False,
-        help=(
-            "A selector name to use for filtering models, as defined in "
-            "selectors.yml. One of --select or --selector must be set, "
-            "but they can't both be set"
-        ),


Since I'm reusing these arguments in another script, I factor them out into the constants module.

jeancochrane · 2024-09-19T21:14:27Z

dbt/scripts/run_iasworld_data_tests.py

+        if not metadata_list:
+            print(f"{tablename} is empty, skipping metadata output")
+            continue


This fixes a bug that I hadn't noticed until implementing the --select and --selector options: If a test run passes, pyarrow will fail to create a Parquet file from test_run_failing_row, since there's no run_year to partition on.

jeancochrane · 2024-09-19T21:15:01Z

dbt/scripts/utils/export.py

+    select_args = (
+        ["--selector", selector] if selector else ["--select", *select]  # type: ignore
+    )


The dbt CLI seems to give selector precedence over select, so I've flipped our logic here to do the same.

jeancochrane · 2024-09-19T21:40:08Z

.github/workflows/test_dbt_models.yaml

+            dirpath="${local_prefix}/${dir}"
+            if [ -e "$dirpath" ]; then
+              echo "Copying ${dirpath} metadata to S3"
+              aws s3 sync "$dirpath" "${s3_prefix}/${dir}"
+            fi


See below for a description of a bug I fixed by sometimes not outputting test_run_failing_row.

jeancochrane · 2024-09-19T21:40:34Z

dbt/scripts/transform_dbt_test_results.py

This file has been deprecated for a while and replaced by run_iasworld_data_tests.py, but it looks like I neglected to delete it.

dfsnow · 2024-09-22T04:52:58Z

.github/workflows/test_dbt_models.yaml

+      select:
+        description: dbt select statement for tests to run
+        required: false
+        default: --selector select_data_test_iasworld


I think it's better to have two different args and then throw an early error if they're both specified. To me that's less confusing, a bit more explicit, and less error prone.

dfsnow · 2024-09-23T00:45:17Z

.github/workflows/test_dbt_models.yaml

+      upload_test_results:
+        description: Upload test results to S3
+        required: false
+        default: true


suggestion (non-blocking): I actually think it's better that we update service-spark-iasworld and have this off by default. Otherwise, won't we constantly be uploading new results every time we trigger this workflow? That might make things difficult if our goal is to check volume over time.

dfsnow · 2024-09-23T00:48:23Z

.github/workflows/test_dbt_models.yaml

+            python3 scripts/run_iasworld_data_tests.py \
+              --target "$TARGET" \
+              --output-dir ./qc_test_results/ \
+              $SELECT_OPTION \


issue: Since we're just passing --select blah_blah to this, couldn't someone pass any arbitrary command here? Seems potentially bad even ignoring malicious reasons.

Hmm, that's a great question! It certainly opens us up to workflow callers providing arbitrary options, although I don't actually think callers could short circuit the command itself and run an entirely different one. I made a few different attempts to hijack the command, which appeared to confirm that GitHub escapes everything correctly as a string:

iasworld_pardat_class_equals_luc; echo oops; exit 1; echo (logs)

"iasworld_pardat_class_equals_luc"; echo "oops"; exit 1; (logs)

"iasworld_pardat_class_equals_luc" && echo "oops" && exit 1 && (logs)

In each case, you can see that the workflow correctly interpreted each space-separated token as an argument to --select and not as its equivalent bash command or keyword. Can you think of another approach that might work for a bad actor who gained access to our workflows?

Either way, I refactored the input variable so that we always add a --select prefix, which should help prevent a caller from passing another option that isn't supported.

dfsnow · 2024-09-23T00:52:59Z

.github/workflows/test_dbt_models.yaml

+          if [[ $CACHE_HIT == 'true' ]]; then
+            # shellcheck disable=SC2086
+            python3 scripts/run_iasworld_data_tests.py \
+              --target "$TARGET" \
+              --output-dir ./qc_test_results/ \
+              $SELECT_OPTION \
+              $SKIP_ARTIFACTS_OPTION \
+              --defer --state "$STATE_DIR"
+          else
+            # shellcheck disable=SC2086
+            python3 scripts/run_iasworld_data_tests.py \
+              --target "$TARGET" \
+              --output-dir ./qc_test_results/ \
+              $SELECT_OPTION \
+              $SKIP_ARTIFACTS_OPTION
+          fi


Maybe just do an inline conditional var that contains those options and is an empty string otherwise? Something like:

DEFER="$(if [[ $CACHE_HIT == 'true' ]]; then echo "--defer --state $STATE_DIR"; else echo ""; fi)" python3 scripts/run_iasworld_data_tests.py \ --target "$TARGET" \ --output-dir ./qc_test_results/ \ $SELECT_OPTION \ $SKIP_ARTIFACTS_OPTION \ $DEFER

jeancochrane · 2024-10-31T18:05:08Z

This should be ready for one more look once you're back next week @dfsnow!

dfsnow

The update looks fine to me @jeancochrane!

.github/workflows/test_dbt_models.yaml

…hind-workflow-variable

Modify test-dbt-models workflow to only upload test results when a …

62f4ab5

…var is set

jeancochrane changed the title ~~Modify test-dbt-models workflow to only upload test results when a var is set~~ Modify test-dbt-models workflow to allow control over upload behavior Sep 17, 2024

Add --skip-artifacts flag to run_iasworld_data_tests script and use i…

a63ae63

…t in `test-dbt-models`

jeancochrane force-pushed the jeancochrane/gate-test-result-s3-upload-behind-workflow-variable branch from 7027ae6 to a63ae63 Compare September 17, 2024 21:33

jeancochrane added 4 commits September 19, 2024 14:32

Add --select and --selector arguments to run_iasworld_data_tests script

46c6292

Remove deprecated transform_dbt_test_results script

ee847c1

Try ignoring shellcheck for one line in test_dbt_models

1e5315a

Fix type error in scripts/utils/export.py

ad14b90

jeancochrane changed the title ~~Modify test-dbt-models workflow to allow control over upload behavior~~ Modify dbt test workflow and script to allow control over test selection and artifact output Sep 19, 2024

jeancochrane added 4 commits September 19, 2024 15:44

Add --defer and --state support to run_iasworld_data_tests and update…

75715a0

… workflow to use them

Set upload_test_results default to true in test-dbt-models work…

c3850f8

…flow

Fix typos in comments in run_iasworld_data_tests

6233205

Tweak test-dbt-models workflow to only upload test metadata Parquet f…

1402f82

…iles if they exist

jeancochrane commented Sep 19, 2024

View reviewed changes

jeancochrane marked this pull request as ready for review September 19, 2024 21:40

jeancochrane requested a review from a team as a code owner September 19, 2024 21:40

jeancochrane requested a review from dfsnow September 19, 2024 21:40

dfsnow approved these changes Sep 23, 2024

View reviewed changes

jeancochrane added 4 commits September 24, 2024 14:28

Clean up test_dbt_models workflow interface

7881118

Remove default for --selector arg to test_dbt_models workflow

38336e2

Friendlier descriptions in test-dbt-models workflow input variables

03f6edc

Remove unnecessary selector input variable in test-dbt-models workflow

38829b1

jeancochrane mentioned this pull request Sep 24, 2024

Set var to upload test results when running GitHub workflow ccao-data/service-spark-iasworld#18

Merged

jeancochrane requested a review from dfsnow October 31, 2024 18:04

dfsnow approved these changes Nov 5, 2024

View reviewed changes

.github/workflows/test_dbt_models.yaml Show resolved Hide resolved

jeancochrane added 2 commits November 7, 2024 13:46

Merge branch 'master' into jeancochrane/gate-test-result-s3-upload-be…

f071c84

…hind-workflow-variable

Remove unused transform_dbt_test_results.py script

a1f02f4

jeancochrane merged commit eb5bdf3 into master Nov 7, 2024
9 of 10 checks passed

jeancochrane deleted the jeancochrane/gate-test-result-s3-upload-behind-workflow-variable branch November 7, 2024 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify dbt test workflow and script to allow control over test selection and artifact output #603

Modify dbt test workflow and script to allow control over test selection and artifact output #603

jeancochrane commented Sep 17, 2024 •

edited

Loading

jeancochrane Sep 19, 2024

dfsnow Sep 22, 2024

jeancochrane Sep 24, 2024

jeancochrane Sep 19, 2024

dfsnow Sep 23, 2024

jeancochrane Sep 24, 2024

jeancochrane Sep 19, 2024

jeancochrane Sep 19, 2024

dfsnow Sep 23, 2024

jeancochrane Sep 24, 2024

jeancochrane Sep 19, 2024

jeancochrane Sep 19, 2024

jeancochrane Sep 19, 2024

jeancochrane Sep 19, 2024

jeancochrane Sep 19, 2024

dfsnow Sep 22, 2024

dfsnow Sep 23, 2024

dfsnow Sep 23, 2024

jeancochrane Sep 24, 2024

dfsnow Sep 23, 2024

jeancochrane commented Oct 31, 2024

dfsnow left a comment

	- name: Restore dbt state cache
	id: cache
	uses: ./.github/actions/restore_dbt_cache
	with:
	path: ${{ env.PROJECT_DIR }}/${{ env.STATE_DIR }}
	key: ${{ env.CACHE_KEY }}

Modify dbt test workflow and script to allow control over test selection and artifact output #603

Modify dbt test workflow and script to allow control over test selection and artifact output #603

Conversation

jeancochrane commented Sep 17, 2024 • edited Loading

Background

Changes

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane commented Oct 31, 2024

dfsnow left a comment

Choose a reason for hiding this comment

jeancochrane commented Sep 17, 2024 •

edited

Loading