-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify dbt test workflow and script to allow control over test selection and artifact output #603
Modify dbt test workflow and script to allow control over test selection and artifact output #603
Conversation
test-dbt-models
workflow to only upload test results when a var is settest-dbt-models
workflow to allow control over upload behavior
…t in `test-dbt-models`
7027ae6
to
a63ae63
Compare
test-dbt-models
workflow to allow control over upload behavior… workflow to use them
…iles if they exist
select: | ||
description: dbt select statement for tests to run | ||
required: false | ||
default: --selector select_data_test_iasworld |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's annoying that this design requires us to include the option name, but I think it's simpler to merge the --select
and --selector
at this level of abstraction rather than requiring the workflow caller to handle two separate variables. I'm open to other design suggestions if you have them, though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to have two different args and then throw an early error if they're both specified. To me that's less confusing, a bit more explicit, and less error prone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I realized that the --selector
argument is unnecessary anyway, since we only have one selector (select_data_test_iasworld
) and we can just default to it if the select
argument is empty. If we ever add more selectors it should be easy to add support for a --selector
argument here, but in the meantime, one input variable that overrides the default selector with a custom --select
statement feels a lot clearer to me.
upload_test_results: | ||
description: Upload test results to S3 | ||
required: false | ||
default: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting this to true
by default means that we don't have to update service-spark-iasworld
to handle this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): I actually think it's better that we update service-spark-iasworld
and have this off by default. Otherwise, won't we constantly be uploading new results every time we trigger this workflow? That might make things difficult if our goal is to check volume over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, won't we constantly be uploading new results every time we trigger this workflow?
Not necessarily, unless workflow callers are leaving it checked. But I hear your point that it's a better nudge to leave it unchecked, and it's an easy change to update service-spark-iasworld
to provide the argument anyway. I made that change and will put up a corresponding service-spark-iasworld
PR.
- name: Restore dbt state cache | ||
id: cache | ||
uses: ./.github/actions/restore_dbt_cache | ||
with: | ||
path: ${{ env.PROJECT_DIR }}/${{ env.STATE_DIR }} | ||
key: ${{ env.CACHE_KEY }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adapted this logic from the build-and-test-dbt
workflow:
data-architecture/.github/workflows/build_and_test_dbt.yaml
Lines 35 to 40 in ac90562
- name: Restore dbt state cache | |
id: cache | |
uses: ./.github/actions/restore_dbt_cache | |
with: | |
path: ${{ env.PROJECT_DIR }}/${{ env.STATE_DIR }} | |
key: ${{ env.CACHE_KEY }} |
if [[ $CACHE_HIT == 'true' ]]; then | ||
# shellcheck disable=SC2086 | ||
python3 scripts/run_iasworld_data_tests.py \ | ||
--target "$TARGET" \ | ||
--output-dir ./qc_test_results/ \ | ||
$SELECT_OPTION \ | ||
$SKIP_ARTIFACTS_OPTION \ | ||
--defer --state "$STATE_DIR" | ||
else | ||
# shellcheck disable=SC2086 | ||
python3 scripts/run_iasworld_data_tests.py \ | ||
--target "$TARGET" \ | ||
--output-dir ./qc_test_results/ \ | ||
$SELECT_OPTION \ | ||
$SKIP_ARTIFACTS_OPTION | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's annoying that these commands are identical, except that we add the --defer
and --state
flags in the case when CACHE_HIT
is true
. Is there a cleaner way to do this in bash that's still readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just do an inline conditional var that contains those options and is an empty string otherwise? Something like:
DEFER="$(if [[ $CACHE_HIT == 'true' ]]; then echo "--defer --state $STATE_DIR"; else echo ""; fi)"
python3 scripts/run_iasworld_data_tests.py \
--target "$TARGET" \
--output-dir ./qc_test_results/ \
$SELECT_OPTION \
$SKIP_ARTIFACTS_OPTION \
$DEFER
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, done!
parser.add_argument( | ||
"--select", | ||
required=False, | ||
nargs="*", | ||
help="One or more dbt select statements to use for filtering models", | ||
*constants.SELECT_ARGUMENT_ARGS, **constants.SELECT_ARGUMENT_KWARGS | ||
) | ||
parser.add_argument( | ||
"--selector", | ||
required=False, | ||
help=( | ||
"A selector name to use for filtering models, as defined in " | ||
"selectors.yml. One of --select or --selector must be set, " | ||
"but they can't both be set" | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I'm reusing these arguments in another script, I factor them out into the constants
module.
if not metadata_list: | ||
print(f"{tablename} is empty, skipping metadata output") | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes a bug that I hadn't noticed until implementing the --select
and --selector
options: If a test run passes, pyarrow will fail to create a Parquet file from test_run_failing_row
, since there's no run_year
to partition on.
select_args = ( | ||
["--selector", selector] if selector else ["--select", *select] # type: ignore | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dbt CLI seems to give selector
precedence over select
, so I've flipped our logic here to do the same.
dirpath="${local_prefix}/${dir}" | ||
if [ -e "$dirpath" ]; then | ||
echo "Copying ${dirpath} metadata to S3" | ||
aws s3 sync "$dirpath" "${s3_prefix}/${dir}" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below for a description of a bug I fixed by sometimes not outputting test_run_failing_row
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file has been deprecated for a while and replaced by run_iasworld_data_tests.py
, but it looks like I neglected to delete it.
select: | ||
description: dbt select statement for tests to run | ||
required: false | ||
default: --selector select_data_test_iasworld |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to have two different args and then throw an early error if they're both specified. To me that's less confusing, a bit more explicit, and less error prone.
upload_test_results: | ||
description: Upload test results to S3 | ||
required: false | ||
default: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): I actually think it's better that we update service-spark-iasworld
and have this off by default. Otherwise, won't we constantly be uploading new results every time we trigger this workflow? That might make things difficult if our goal is to check volume over time.
python3 scripts/run_iasworld_data_tests.py \ | ||
--target "$TARGET" \ | ||
--output-dir ./qc_test_results/ \ | ||
$SELECT_OPTION \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Since we're just passing --select blah_blah
to this, couldn't someone pass any arbitrary command here? Seems potentially bad even ignoring malicious reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's a great question! It certainly opens us up to workflow callers providing arbitrary options, although I don't actually think callers could short circuit the command itself and run an entirely different one. I made a few different attempts to hijack the command, which appeared to confirm that GitHub escapes everything correctly as a string:
iasworld_pardat_class_equals_luc; echo oops; exit 1; echo
(logs)"iasworld_pardat_class_equals_luc"; echo "oops"; exit 1;
(logs)"iasworld_pardat_class_equals_luc" && echo "oops" && exit 1 &&
(logs)
In each case, you can see that the workflow correctly interpreted each space-separated token as an argument to --select
and not as its equivalent bash command or keyword. Can you think of another approach that might work for a bad actor who gained access to our workflows?
Either way, I refactored the input variable so that we always add a --select
prefix, which should help prevent a caller from passing another option that isn't supported.
if [[ $CACHE_HIT == 'true' ]]; then | ||
# shellcheck disable=SC2086 | ||
python3 scripts/run_iasworld_data_tests.py \ | ||
--target "$TARGET" \ | ||
--output-dir ./qc_test_results/ \ | ||
$SELECT_OPTION \ | ||
$SKIP_ARTIFACTS_OPTION \ | ||
--defer --state "$STATE_DIR" | ||
else | ||
# shellcheck disable=SC2086 | ||
python3 scripts/run_iasworld_data_tests.py \ | ||
--target "$TARGET" \ | ||
--output-dir ./qc_test_results/ \ | ||
$SELECT_OPTION \ | ||
$SKIP_ARTIFACTS_OPTION | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just do an inline conditional var that contains those options and is an empty string otherwise? Something like:
DEFER="$(if [[ $CACHE_HIT == 'true' ]]; then echo "--defer --state $STATE_DIR"; else echo ""; fi)"
python3 scripts/run_iasworld_data_tests.py \
--target "$TARGET" \
--output-dir ./qc_test_results/ \
$SELECT_OPTION \
$SKIP_ARTIFACTS_OPTION \
$DEFER
This should be ready for one more look once you're back next week @dfsnow! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The update looks fine to me @jeancochrane!
Background
This PR adjusts a few behaviors of the
test-dbt-models
workflow and its underlying scriptrun_iasworld_data_tests
that I've come to find frustrating in the process of developing data tests:Changes
We implement the following changes to improve the developer experience of using the script and the workflow:
--select
and--selector
arguments to therun_iasworld_data_tests.py
script that override the default--selector select_data_test_iasworld
behaviorselect
input variable to thetest-dbt-models
workflow that passes through torun_iasworld_data_tests.py
--defer
and--state
arguments to therun_iasworld_data_tests.py
script that pass through to the underlyingdbt test
calltest-dbt-models
workflow to pull state from the remote cache and pass it through via--defer
and--state
on cache hit (similar to how thebuild-and-run-dbt
workflow works)--skip-artifacts
/--no-skip-artifacts
boolean option torun_iasworld_data_tests.py
with the following behavior:--skip-artifacts
: Exits once tests have completed--no-skip-artifacts
(default): Parses test metadata and produces output artifacts (workbook and Parquet files)upload_test_results
input variable to thetest-dbt-models
workflow that controls whether--skip-artifacts
or--no-skip-artifacts
gets passed to the underlying scriptTesting
See these three workflow runs for examples of each set of behaviors:
select="--select iasworld_htagnt_unique_by_agent"
andupload_test_results=false
: One test runs, no artifacts generated or uploadedselect="--selector data_test_iasworld"
andupload_test_results=false
: All tests run, no artifacts generated or uploadedselect="--selector data_test_iasworld"
andupload_test_results=true
: All tests run, artifacts generated and uploaded