Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify dbt test workflow and script to allow control over test selection and artifact output #603

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Sep 17, 2024

Background

This PR adjusts a few behaviors of the test-dbt-models workflow and its underlying script run_iasworld_data_tests that I've come to find frustrating in the process of developing data tests:

  • The script and the workflow always run all data tests, even when I only want to run a subset of them
  • The workflow doesn't implement deferral, meaning it can't run on a branch unless you manually create all of the missing nodes
  • The script and the workflow always parse run metadata to produce output artifacts, even when I don't need them; as a result, script/workflow runs can take ~20 minutes longer than they otherwise would
  • The workflow always uploads test result artifacts to S3, even when the output is not significant (that is, basically any run except for the sqoop-bot's runs); as a result, we use S3 storage space that we don't need

Changes

We implement the following changes to improve the developer experience of using the script and the workflow:

  • Allow the caller to select which tests will run
    • Adds --select and --selector arguments to the run_iasworld_data_tests.py script that override the default --selector select_data_test_iasworld behavior
    • Adds a select input variable to the test-dbt-models workflow that passes through to run_iasworld_data_tests.py
  • Support deferral so that the workflow can run on branches with no extra intervention
    • Adds --defer and --state arguments to the run_iasworld_data_tests.py script that pass through to the underlying dbt test call
    • Updates the test-dbt-models workflow to pull state from the remote cache and pass it through via --defer and --state on cache hit (similar to how the build-and-run-dbt workflow works)
  • Allow the caller to limit artifact creation and upload
    • Adds a --skip-artifacts/--no-skip-artifacts boolean option to run_iasworld_data_tests.py with the following behavior:
      • --skip-artifacts: Exits once tests have completed
      • --no-skip-artifacts (default): Parses test metadata and produces output artifacts (workbook and Parquet files)
    • Adds an upload_test_results input variable to the test-dbt-models workflow that controls whether --skip-artifacts or --no-skip-artifacts gets passed to the underlying script

Testing

See these three workflow runs for examples of each set of behaviors:

@jeancochrane jeancochrane changed the title Modify test-dbt-models workflow to only upload test results when a var is set Modify test-dbt-models workflow to allow control over upload behavior Sep 17, 2024
@jeancochrane jeancochrane force-pushed the jeancochrane/gate-test-result-s3-upload-behind-workflow-variable branch from 7027ae6 to a63ae63 Compare September 17, 2024 21:33
@jeancochrane jeancochrane changed the title Modify test-dbt-models workflow to allow control over upload behavior Modify dbt test workflow and script to allow control over test selection and artifact output Sep 19, 2024
select:
description: dbt select statement for tests to run
required: false
default: --selector select_data_test_iasworld
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's annoying that this design requires us to include the option name, but I think it's simpler to merge the --select and --selector at this level of abstraction rather than requiring the workflow caller to handle two separate variables. I'm open to other design suggestions if you have them, though!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have two different args and then throw an early error if they're both specified. To me that's less confusing, a bit more explicit, and less error prone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I realized that the --selector argument is unnecessary anyway, since we only have one selector (select_data_test_iasworld) and we can just default to it if the select argument is empty. If we ever add more selectors it should be easy to add support for a --selector argument here, but in the meantime, one input variable that overrides the default selector with a custom --select statement feels a lot clearer to me.

upload_test_results:
description: Upload test results to S3
required: false
default: true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting this to true by default means that we don't have to update service-spark-iasworld to handle this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): I actually think it's better that we update service-spark-iasworld and have this off by default. Otherwise, won't we constantly be uploading new results every time we trigger this workflow? That might make things difficult if our goal is to check volume over time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, won't we constantly be uploading new results every time we trigger this workflow?

Not necessarily, unless workflow callers are leaving it checked. But I hear your point that it's a better nudge to leave it unchecked, and it's an easy change to update service-spark-iasworld to provide the argument anyway. I made that change and will put up a corresponding service-spark-iasworld PR.

Comment on lines +37 to +42
- name: Restore dbt state cache
id: cache
uses: ./.github/actions/restore_dbt_cache
with:
path: ${{ env.PROJECT_DIR }}/${{ env.STATE_DIR }}
key: ${{ env.CACHE_KEY }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adapted this logic from the build-and-test-dbt workflow:

- name: Restore dbt state cache
id: cache
uses: ./.github/actions/restore_dbt_cache
with:
path: ${{ env.PROJECT_DIR }}/${{ env.STATE_DIR }}
key: ${{ env.CACHE_KEY }}

Comment on lines 51 to 66
if [[ $CACHE_HIT == 'true' ]]; then
# shellcheck disable=SC2086
python3 scripts/run_iasworld_data_tests.py \
--target "$TARGET" \
--output-dir ./qc_test_results/ \
$SELECT_OPTION \
$SKIP_ARTIFACTS_OPTION \
--defer --state "$STATE_DIR"
else
# shellcheck disable=SC2086
python3 scripts/run_iasworld_data_tests.py \
--target "$TARGET" \
--output-dir ./qc_test_results/ \
$SELECT_OPTION \
$SKIP_ARTIFACTS_OPTION
fi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's annoying that these commands are identical, except that we add the --defer and --state flags in the case when CACHE_HIT is true. Is there a cleaner way to do this in bash that's still readable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just do an inline conditional var that contains those options and is an empty string otherwise? Something like:

DEFER="$(if [[ $CACHE_HIT == 'true' ]]; then echo "--defer --state $STATE_DIR"; else echo ""; fi)"
python3 scripts/run_iasworld_data_tests.py \
  --target "$TARGET" \
  --output-dir ./qc_test_results/ \
  $SELECT_OPTION \
  $SKIP_ARTIFACTS_OPTION \
  $DEFER

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, done!

Comment on lines 51 to -67
parser.add_argument(
"--select",
required=False,
nargs="*",
help="One or more dbt select statements to use for filtering models",
*constants.SELECT_ARGUMENT_ARGS, **constants.SELECT_ARGUMENT_KWARGS
)
parser.add_argument(
"--selector",
required=False,
help=(
"A selector name to use for filtering models, as defined in "
"selectors.yml. One of --select or --selector must be set, "
"but they can't both be set"
),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I'm reusing these arguments in another script, I factor them out into the constants module.

Comment on lines +908 to +910
if not metadata_list:
print(f"{tablename} is empty, skipping metadata output")
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a bug that I hadn't noticed until implementing the --select and --selector options: If a test run passes, pyarrow will fail to create a Parquet file from test_run_failing_row, since there's no run_year to partition on.

Comment on lines +48 to +50
select_args = (
["--selector", selector] if selector else ["--select", *select] # type: ignore
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dbt CLI seems to give selector precedence over select, so I've flipped our logic here to do the same.

Comment on lines +84 to +88
dirpath="${local_prefix}/${dir}"
if [ -e "$dirpath" ]; then
echo "Copying ${dirpath} metadata to S3"
aws s3 sync "$dirpath" "${s3_prefix}/${dir}"
fi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below for a description of a bug I fixed by sometimes not outputting test_run_failing_row.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has been deprecated for a while and replaced by run_iasworld_data_tests.py, but it looks like I neglected to delete it.

@jeancochrane jeancochrane marked this pull request as ready for review September 19, 2024 21:40
@jeancochrane jeancochrane requested a review from a team as a code owner September 19, 2024 21:40
select:
description: dbt select statement for tests to run
required: false
default: --selector select_data_test_iasworld
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have two different args and then throw an early error if they're both specified. To me that's less confusing, a bit more explicit, and less error prone.

upload_test_results:
description: Upload test results to S3
required: false
default: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): I actually think it's better that we update service-spark-iasworld and have this off by default. Otherwise, won't we constantly be uploading new results every time we trigger this workflow? That might make things difficult if our goal is to check volume over time.

python3 scripts/run_iasworld_data_tests.py \
--target "$TARGET" \
--output-dir ./qc_test_results/ \
$SELECT_OPTION \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Since we're just passing --select blah_blah to this, couldn't someone pass any arbitrary command here? Seems potentially bad even ignoring malicious reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's a great question! It certainly opens us up to workflow callers providing arbitrary options, although I don't actually think callers could short circuit the command itself and run an entirely different one. I made a few different attempts to hijack the command, which appeared to confirm that GitHub escapes everything correctly as a string:

  • iasworld_pardat_class_equals_luc; echo oops; exit 1; echo (logs)
  • "iasworld_pardat_class_equals_luc"; echo "oops"; exit 1; (logs)
  • "iasworld_pardat_class_equals_luc" && echo "oops" && exit 1 && (logs)

In each case, you can see that the workflow correctly interpreted each space-separated token as an argument to --select and not as its equivalent bash command or keyword. Can you think of another approach that might work for a bad actor who gained access to our workflows?

Either way, I refactored the input variable so that we always add a --select prefix, which should help prevent a caller from passing another option that isn't supported.

Comment on lines 51 to 66
if [[ $CACHE_HIT == 'true' ]]; then
# shellcheck disable=SC2086
python3 scripts/run_iasworld_data_tests.py \
--target "$TARGET" \
--output-dir ./qc_test_results/ \
$SELECT_OPTION \
$SKIP_ARTIFACTS_OPTION \
--defer --state "$STATE_DIR"
else
# shellcheck disable=SC2086
python3 scripts/run_iasworld_data_tests.py \
--target "$TARGET" \
--output-dir ./qc_test_results/ \
$SELECT_OPTION \
$SKIP_ARTIFACTS_OPTION
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just do an inline conditional var that contains those options and is an empty string otherwise? Something like:

DEFER="$(if [[ $CACHE_HIT == 'true' ]]; then echo "--defer --state $STATE_DIR"; else echo ""; fi)"
python3 scripts/run_iasworld_data_tests.py \
  --target "$TARGET" \
  --output-dir ./qc_test_results/ \
  $SELECT_OPTION \
  $SKIP_ARTIFACTS_OPTION \
  $DEFER

@jeancochrane
Copy link
Contributor Author

This should be ready for one more look once you're back next week @dfsnow!

Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update looks fine to me @jeancochrane!

.github/workflows/test_dbt_models.yaml Show resolved Hide resolved
@jeancochrane jeancochrane merged commit eb5bdf3 into master Nov 7, 2024
9 of 10 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/gate-test-result-s3-upload-behind-workflow-variable branch November 7, 2024 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants