From 1c33f8902b2a7dc3fdf8923a5c2296ad5aba3c8d Mon Sep 17 00:00:00 2001 From: Kim Andrews <17375001+kimandrews@users.noreply.github.com> Date: Fri, 5 Apr 2024 14:33:35 -0700 Subject: [PATCH 1/8] Start GH Action workflow for automation Currently just runs the ingest workflow and uploads the results to AWS S3. Subsequent commits will add automation for the phylogenetic workflow. Follows Zika PR #52 nextstrain/zika@d44f2ae --- .github/workflows/ingest-to-phylogenetic.yaml | 40 +++++++++++++++++++ 1 file changed, 40 insertions(+) create mode 100644 .github/workflows/ingest-to-phylogenetic.yaml diff --git a/.github/workflows/ingest-to-phylogenetic.yaml b/.github/workflows/ingest-to-phylogenetic.yaml new file mode 100644 index 0000000..1177c25 --- /dev/null +++ b/.github/workflows/ingest-to-phylogenetic.yaml @@ -0,0 +1,40 @@ +name: Ingest to phylogenetic + +defaults: + run: + # This is the same as GitHub Action's `bash` keyword as of 20 June 2023: + # https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsshell + # + # Completely spelling it out here so that GitHub can't change it out from under us + # and we don't have to refer to the docs to know the expected behavior. + shell: bash --noprofile --norc -eo pipefail {0} + +on: + workflow_dispatch: + +jobs: + ingest: + permissions: + id-token: write + uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master + secrets: inherit + with: + # Starting with the default docker runtime + # We can migrate to AWS Batch when/if we need to for more resources or if + # the job runs longer than the GH Action limit of 6 hours. + runtime: docker + run: | + nextstrain build \ + --env AWS_ACCESS_KEY_ID \ + --env AWS_SECRET_ACCESS_KEY \ + ingest \ + upload_all \ + --configfile build-configs/nextstrain-automation/config.yaml + # Specifying artifact name to differentiate ingest build outputs from + # the phylogenetic build outputs + artifact-name: ingest-build-output + artifact-paths: | + ingest/results/ + ingest/benchmarks/ + ingest/logs/ + ingest/.snakemake/log/ From cf3922123cb28062b6c0eb37d1c55e467c3d346b Mon Sep 17 00:00:00 2001 From: Kim Andrews <17375001+kimandrews@users.noreply.github.com> Date: Fri, 5 Apr 2024 14:43:32 -0700 Subject: [PATCH 2/8] ingest-to-phylogenetic: Add phylogenetic job The phylogenetic workflow will run after the ingest workflow has completed successfully to use the latest available data. Subsequent commits will check if the ingest results included new data to only run the phylogenetic workflow when there's new data. Following Zika PR #52 nextstrain/zika@2c415e7 --- .github/workflows/ingest-to-phylogenetic.yaml | 31 +++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/.github/workflows/ingest-to-phylogenetic.yaml b/.github/workflows/ingest-to-phylogenetic.yaml index 1177c25..9ef89a6 100644 --- a/.github/workflows/ingest-to-phylogenetic.yaml +++ b/.github/workflows/ingest-to-phylogenetic.yaml @@ -38,3 +38,34 @@ jobs: ingest/benchmarks/ ingest/logs/ ingest/.snakemake/log/ + + # TKTK check if ingest results include new data + # potentially use actions/cache to store Metadata.sha256sum of S3 files + + phylogenetic: + needs: [ingest] + permissions: + id-token: write + uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master + secrets: inherit + with: + # Starting with the default docker runtime + # We can migrate to AWS Batch when/if we need to for more resources or if + # the job runs longer than the GH Action limit of 6 hours. + runtime: docker + run: | + nextstrain build \ + --env AWS_ACCESS_KEY_ID \ + --env AWS_SECRET_ACCESS_KEY \ + phylogenetic \ + deploy_all \ + --configfile build-configs/nextstrain-automation/config.yaml + # Specifying artifact name to differentiate ingest build outputs from + # the phylogenetic build outputs + artifact-name: phylogenetic-build-output + artifact-paths: | + phylogenetic/auspice/ + phylogenetic/results/ + phylogenetic/benchmarks/ + phylogenetic/logs/ + phylogenetic/.snakemake/log/ From 2a28964edcd2d5044f4daae551e5ea34d66b1262 Mon Sep 17 00:00:00 2001 From: Kim Andrews <17375001+kimandrews@users.noreply.github.com> Date: Fri, 5 Apr 2024 14:51:46 -0700 Subject: [PATCH 3/8] ingest-to-phylogenetic: Use cache to check new data Uses GitHub Actions cache to store a file that contains the `Metadata.sh256sum` of the ingest files on S3 and use the `hashFiles` function to create a unique cache key. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Then the existence of the cache key is an indicator that the ingest file contents have not been updated since a previous run on GH Actions. This does come with a big caveat that GH will remove any cache entries that have not been accessed in over 7 days.¹ If the workflow is not being automatically run within 7 days, then it will always run the phylogenetic job. If this works well, then we may want to consider moving this within the `pathogen-repo-build` reusable workflow to have the same functionality across pathogen automation workflows. ¹ https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#usage-limits-and-eviction-policy Follows Zika PR #52 nextstrain/zika@eb5e76d --- .github/workflows/ingest-to-phylogenetic.yaml | 45 +++++++++++++++++-- 1 file changed, 42 insertions(+), 3 deletions(-) diff --git a/.github/workflows/ingest-to-phylogenetic.yaml b/.github/workflows/ingest-to-phylogenetic.yaml index 9ef89a6..2b4105d 100644 --- a/.github/workflows/ingest-to-phylogenetic.yaml +++ b/.github/workflows/ingest-to-phylogenetic.yaml @@ -39,11 +39,50 @@ jobs: ingest/logs/ ingest/.snakemake/log/ - # TKTK check if ingest results include new data - # potentially use actions/cache to store Metadata.sha256sum of S3 files + # Check if ingest results include new data by checking for the cache + # of the file with the results' Metadata.sh256sum (which should have been added within upload-to-s3) + # GitHub will remove any cache entries that have not been accessed in over 7 days, + # so if the workflow has not been run over 7 days then it will trigger phylogenetic. + check-new-data: + needs: [ingest] + runs-on: ubuntu-latest + outputs: + cache-hit: ${{ steps.check-cache.outputs.cache-hit }} + steps: + - name: Get sha256sum + id: get-sha256sum + run: | + s3_urls=( + "s3://nextstrain-data/files/workflows/measles/metadata.tsv.zst" + "s3://nextstrain-data/files/workflows/measles/sequences.fasta.zst" + ) + + # Code below is modified from ingest/upload-to-s3 + # https://github.com/nextstrain/ingest/blob/c0b4c6bb5e6ccbba86374d2c09b42077768aac23/upload-to-s3#L23-L29 + + + no_hash=0000000000000000000000000000000000000000000000000000000000000000 + + for s3_url in "${s3_urls[@]}"; do + s3path="${s3_url#s3://}" + bucket="${s3path%%/*}" + key="${s3path#*/}" + s3_hash="$(aws s3api head-object --no-sign-request --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")" + echo "${s3_hash}" >> ingest-output-sha256sum + done + + - name: Check cache + id: check-cache + uses: actions/cache@v4 + with: + path: ingest-output-sha256sum + key: ingest-output-sha256sum-${{ hashFiles('ingest-output-sha256sum') }} + lookup-only: true + phylogenetic: - needs: [ingest] + needs: [check-new-data] + if: ${{ needs.check-new-data.outputs.cache-hit != 'true' }} permissions: id-token: write uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master From bb1abd13f10f925f9e974d2700b6b3086e9cac40 Mon Sep 17 00:00:00 2001 From: Kim Andrews <17375001+kimandrews@users.noreply.github.com> Date: Fri, 5 Apr 2024 15:15:09 -0700 Subject: [PATCH 4/8] ingest-to-phylo: Add inputs for Docker image Add individuals inputs per workflow to override the default Docker image used by `nextstrain build`. Having this input has been extremely helpful to continue running pathogen workflows when we run into new bugs that are not present in older nextstrain-base images. There are separate image inputs for the two workflows because they use different tools and may require different versions of images. Follows Zika PR #52 nextstrain/zika@65a8acc --- .github/workflows/ingest-to-phylogenetic.yaml | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/.github/workflows/ingest-to-phylogenetic.yaml b/.github/workflows/ingest-to-phylogenetic.yaml index 2b4105d..d32cdcd 100644 --- a/.github/workflows/ingest-to-phylogenetic.yaml +++ b/.github/workflows/ingest-to-phylogenetic.yaml @@ -11,6 +11,13 @@ defaults: on: workflow_dispatch: + inputs: + ingest_image: + description: 'Specific container image to use for ingest workflow (will override the default of "nextstrain build")' + required: false + phylogenetic_image: + description: 'Specific container image to use for phylogenetic workflow (will override the default of "nextstrain build")' + required: false jobs: ingest: @@ -23,6 +30,8 @@ jobs: # We can migrate to AWS Batch when/if we need to for more resources or if # the job runs longer than the GH Action limit of 6 hours. runtime: docker + env: | + NEXTSTRAIN_DOCKER_IMAGE: ${{ inputs.ingest_image }} run: | nextstrain build \ --env AWS_ACCESS_KEY_ID \ @@ -92,6 +101,8 @@ jobs: # We can migrate to AWS Batch when/if we need to for more resources or if # the job runs longer than the GH Action limit of 6 hours. runtime: docker + env: | + NEXTSTRAIN_DOCKER_IMAGE: ${{ inputs.phylogenetic_image }} run: | nextstrain build \ --env AWS_ACCESS_KEY_ID \ From ce7f5bc55455ce4f3366d9b7e0910d98795b6fc7 Mon Sep 17 00:00:00 2001 From: Kim Andrews <17375001+kimandrews@users.noreply.github.com> Date: Fri, 5 Apr 2024 15:20:47 -0700 Subject: [PATCH 5/8] ingest-to-phylo: Add schedule Copied daily schedule of mpox ingest https://github.com/nextstrain/mpox/blob/e439235ff1c1d66e7285b774e9536e2896d9cd2f/.github/workflows/fetch-and-ingest.yaml#L4-L21 Daily runs seem fine since the ingest workflow currently takes less than 2 minutes to complete and it will not trigger the phylogenetic workflow if there's no new data. We can bring this down to once a week if it seems like overkill. Follows Zika PR #52 nextstrain/zika@77ca1d4 --- .github/workflows/ingest-to-phylogenetic.yaml | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/.github/workflows/ingest-to-phylogenetic.yaml b/.github/workflows/ingest-to-phylogenetic.yaml index d32cdcd..72f9ae3 100644 --- a/.github/workflows/ingest-to-phylogenetic.yaml +++ b/.github/workflows/ingest-to-phylogenetic.yaml @@ -10,6 +10,25 @@ defaults: shell: bash --noprofile --norc -eo pipefail {0} on: + schedule: + # Note times are in UTC, which is 1 or 2 hours behind CET depending on daylight savings. + # + # Note the actual runs might be late. + # Numerous people were confused, about that, including me: + # - https://github.community/t/scheduled-action-running-consistently-late/138025/11 + # - https://github.com/github/docs/issues/3059 + # + # Note, '*' is a special character in YAML, so you have to quote this string. + # + # Docs: + # - https://docs.github.com/en/actions/learn-github-actions/events-that-trigger-workflows#schedule + # + # Tool that deciphers this particular format of crontab string: + # - https://crontab.guru/ + # + # Runs at 4pm UTC (12pm EDT) since curation by NCBI happens on the East Coast. + - cron: '0 16 * * *' + workflow_dispatch: inputs: ingest_image: From ec566f934ca0fcb29a4bf57b9b735192ee4702e5 Mon Sep 17 00:00:00 2001 From: Kim Andrews <17375001+kimandrews@users.noreply.github.com> Date: Tue, 9 Apr 2024 15:53:24 -0700 Subject: [PATCH 6/8] ingest-to-phylogenetic: tee hash for easy debugging Follows Zika PR #52 nextstrain/zika@cdd071e --- .github/workflows/ingest-to-phylogenetic.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/ingest-to-phylogenetic.yaml b/.github/workflows/ingest-to-phylogenetic.yaml index 72f9ae3..b4f2e32 100644 --- a/.github/workflows/ingest-to-phylogenetic.yaml +++ b/.github/workflows/ingest-to-phylogenetic.yaml @@ -97,7 +97,7 @@ jobs: key="${s3path#*/}" s3_hash="$(aws s3api head-object --no-sign-request --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")" - echo "${s3_hash}" >> ingest-output-sha256sum + echo "${s3_hash}" | tee -a ingest-output-sha256sum done - name: Check cache From 671df7719816c404f8b373ef47a3b92ea35a573b Mon Sep 17 00:00:00 2001 From: Kim Andrews <17375001+kimandrews@users.noreply.github.com> Date: Tue, 9 Apr 2024 16:03:58 -0700 Subject: [PATCH 7/8] ingest-to-phylogenetic: Add AWS_DEFAULT_REGION MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Follows Zika PR #52 nextstrain/zika@f615170 Uses the variable `AWS_DEFAULT_REGION` that was added to the Nextstrain GitHub organization variables.¹ ¹ https://github.com/organizations/nextstrain/settings/variables/actions --- .github/workflows/ingest-to-phylogenetic.yaml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.github/workflows/ingest-to-phylogenetic.yaml b/.github/workflows/ingest-to-phylogenetic.yaml index b4f2e32..87dc075 100644 --- a/.github/workflows/ingest-to-phylogenetic.yaml +++ b/.github/workflows/ingest-to-phylogenetic.yaml @@ -79,6 +79,8 @@ jobs: steps: - name: Get sha256sum id: get-sha256sum + env: + AWS_DEFAULT_REGION: ${{ vars.AWS_DEFAULT_REGION }} run: | s3_urls=( "s3://nextstrain-data/files/workflows/measles/metadata.tsv.zst" From 0d42723c623b72c862edd39d39cdba2cdb9eaf99 Mon Sep 17 00:00:00 2001 From: Kim Andrews <17375001+kimandrews@users.noreply.github.com> Date: Wed, 10 Apr 2024 13:42:07 -0700 Subject: [PATCH 8/8] update Changelog --- CHANGES.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/CHANGES.md b/CHANGES.md index f65d6d8..4760a6c 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -1,6 +1,7 @@ # CHANGELOG -* 2 April 2024: Add nextstrain-automation build-configs for deploying the final Auspice dataset of the phylogenetic workflow -* 1 April 2024: Create a tree using the 450 nucleotides encoding the carboxyl-terminal 150 amino acids of the nucleoprotein (N450), which is highly represented on NCBI for measles. [PR #20](https://github.com/nextstrain/measles/pull/20) +* 10 April 2024: Add a single GH Action workflow to automate the ingest and phylogenetic workflows [PR #22](https://github.com/nextstrain/measles/pull/22) +* 2 April 2024: Add nextstrain-automation build-configs for deploying the final Auspice dataset of the phylogenetic workflow [PR #21](https://github.com/nextstrain/measles/pull/21) +* 1 April 2024: Create a "N450" tree using the 450 nucleotides encoding the carboxyl-terminal 150 amino acids of the nucleoprotein, which is highly represented on NCBI for measles. [PR #20](https://github.com/nextstrain/measles/pull/20) * 15 March 2024: Connect ingest and phylogenetic workflows to follow the pathogen-repo-guide by uploading ingest output to S3, downloading ingest output from S3 to phylogenetic directory, using "accession" column as the ID column, and using a color scheme that matches the new region name format. [PR #19](https://github.com/nextstrain/measles/pull/19) * 1 March 2024: Add phylogenetic directory to follow the pathogen-repo-guide, and update the CI workflow to match the new file structure. [PR #18](https://github.com/nextstrain/measles/pull/18) * 14 February 2024: Add ingest directory from pathogen-repo-guide and make measles-specific modifications. [PR #10](https://github.com/nextstrain/measles/pull/10)