Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the deploy.yaml workflow to run the model on CI #26

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
c565cd1
Add Dockerfile and workflow for building and pushing Docker image to …
jeancochrane Oct 23, 2023
b93c3a6
Cache R and pip packages
jeancochrane Oct 24, 2023
d74fb4b
Temporarily simplify Dockerfile to test caching
jeancochrane Oct 24, 2023
efcf7e6
Attempt to fix up cache paths for GitHub workflow
jeancochrane Oct 24, 2023
92c2b1e
Temporarily try a simpler R package for testing cache
jeancochrane Oct 24, 2023
644e916
Try using cache dir env vars to pass mount paths to Dockerfile RUN co…
jeancochrane Oct 24, 2023
afa0c8d
Revert pip/renv caching for now
jeancochrane Oct 24, 2023
7223f4c
Add CML runner to deploy workflow and use it to run the model
jeancochrane Oct 25, 2023
4e3ab96
Remove explicit permissions from publish-docker-image job in deploy w…
jeancochrane Oct 25, 2023
4699510
Change SHA Docker tagging scheme to follow GitHub tags instead
jeancochrane Oct 25, 2023
35b15ed
Switch to PPM for R package installs
jeancochrane Oct 25, 2023
c41c3a7
Install more system dependencies into Dockerfile
jeancochrane Oct 25, 2023
3edd026
Fixup renv install in Dockerfile
jeancochrane Oct 25, 2023
62c144a
Simplify Pipfile to only keep dvc dependency
jeancochrane Oct 25, 2023
d1e9ff7
Try explicitly copying renv path to make sure it gets added in Docker…
jeancochrane Oct 25, 2023
9f93b29
Merge Dockerfile branch into model running branch
jeancochrane Oct 25, 2023
4be0bbd
Make CML --cloud-type and --cloud-gpu configurable
jeancochrane Oct 25, 2023
b851aa6
Don't require `deploy` environment in publish-docker-image workflow job
jeancochrane Oct 25, 2023
381b7e2
Merge branch 'master' into 'jeancochrane/23-infra-updates-add-a-workf…
jeancochrane Oct 25, 2023
55ee7fe
Try another format for handling optional workflow inputs in deploy.yaml
jeancochrane Oct 25, 2023
04f2246
Trigger CI run
jeancochrane Oct 25, 2023
6573c92
Fix enable_gpu input in deploy.yaml workflow
jeancochrane Oct 25, 2023
57d1c8d
Try using equals signs for `cml runner launch` invokation in deploy.y…
jeancochrane Oct 25, 2023
0dea848
Try noRetry instead of no-retry for cml runner launch flag
jeancochrane Oct 25, 2023
d1df1f1
Fix renv path in Dockerfile
jeancochrane Oct 26, 2023
7e2f44f
Fix typo in on.push.branches in deploy.yaml workflow
jeancochrane Oct 26, 2023
076bd9e
Strip out legacy CML implementation from deploy workflow
jeancochrane Oct 26, 2023
c6e0c9c
Add deploy.yaml workflow job and Terraform config for running the mod…
jeancochrane Oct 27, 2023
09defd0
Fix bash syntax in set-vars step of run-model job in deploy.yaml
jeancochrane Oct 30, 2023
4cb0461
Specify filename for Terraform variables in deploy.yaml
jeancochrane Oct 30, 2023
3753c46
Set outputs.workspace in set-vars step of run-model job in deploy.yaml
jeancochrane Oct 30, 2023
867b562
Add masks for sensitive AWS IDs to deploy.yaml
jeancochrane Oct 30, 2023
bb9b55b
Fix typo in "Submit job" step of deploy.yaml
jeancochrane Oct 30, 2023
f086c3e
Set proper working directory for "Submit job" step of deploy.yaml
jeancochrane Oct 30, 2023
c0c6bf9
Strip quotes from `terraform output` calls in batch submit-job in dep…
jeancochrane Oct 30, 2023
25e2d0d
Temporarily set -x in submit-job step of deploy.yaml
jeancochrane Oct 30, 2023
c25920a
Echo values of Terraform outputs in submit-job task in deploy.yaml
jeancochrane Oct 30, 2023
e66522c
Use terraform-bin when saving output to var to avoid wrapper script i…
jeancochrane Oct 30, 2023
836db64
Print job ID when retrieving Batch job details in deploy.yaml
jeancochrane Oct 30, 2023
1a79c45
Fix LOOP_COUNTER print statement in deploy.yaml
jeancochrane Oct 30, 2023
b758986
Temporarily print terraform.tfvars content in deploy.yaml
jeancochrane Oct 30, 2023
f11ec9b
Revert "Temporarily print terraform.tfvars content in deploy.yaml"
jeancochrane Oct 30, 2023
5cbeb15
Temporarily print build-and-push step output in deploy.yaml
jeancochrane Oct 30, 2023
ee91e3a
Attempt to extract image.name from image metadata in deploy.yaml
jeancochrane Oct 30, 2023
e324016
Change container image ID -> container image name in main.tf and depl…
jeancochrane Oct 30, 2023
a54355c
Attempt to fix loop counter increment logic in deploy.yaml
jeancochrane Oct 30, 2023
9ec240a
Remove extraneous quotes from Dockerfile CMD
jeancochrane Oct 31, 2023
78493a1
Fix Batch job log URL for printing in GitHub workflow logs
jeancochrane Oct 31, 2023
8377dde
Separate polling times for startup vs. running jobs in deploy.yaml
jeancochrane Oct 31, 2023
a1db4ea
Refactor deploy.yaml to move Terraform setup into dedicated setup-ter…
jeancochrane Oct 31, 2023
6069af2
Add cleanup-terraform workflow for deleting AWS resources when PRs ar…
jeancochrane Oct 31, 2023
ec6e9fb
Temporarily run cleanup-terraform and not deploy on PRs, for testing …
jeancochrane Oct 31, 2023
e73bc6e
Revert "Temporarily run cleanup-terraform and not deploy on PRs, for …
jeancochrane Oct 31, 2023
44beedb
Trigger full model run in Dockerfile CMD
jeancochrane Oct 31, 2023
302ce0b
Clean up docstrings and variable names ahead of review
jeancochrane Oct 31, 2023
6602dd7
Switch to ccao-ecs-model-runner role for Batch task execution
jeancochrane Nov 2, 2023
202f941
Temporarily refactor image to just install awscli
jeancochrane Nov 2, 2023
146b3b9
Add aws.ec2metadata package to enable ECS to use task role
jeancochrane Nov 2, 2023
ca7a306
Revert "Temporarily refactor image to just install awscli"
jeancochrane Nov 2, 2023
2a85c14
Tweak deploy.yaml and factor out Batch polling logic into shell script
jeancochrane Nov 6, 2023
45bc5e2
Make poll_batch_job_status.sh executable
jeancochrane Nov 6, 2023
f4384f9
Use parameter expansion to check for unset variables in poll_batch_jo…
jeancochrane Nov 6, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions .github/actions/setup-terraform/action.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
name: Setup Terraform
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this action has a lot of logic in common with the equivalent action in model-sales-val. When I copy the workflows from this repo over to the condo model in ccao-data/model-condo-avm#5 I may decide to factor this logic out into an action that can be reused by workflows across repos; in the meantime, however, we accept a little bit of cross-repo duplication in order to iterate more quickly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think factoring this out would be wise, considering it's a pattern we're very likely to re-use. How about we make a ccao-data/actions repo, since we're starting to replicate lots of actions in general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable! My one concern is that a monorepo would make it harder to version individual actions, since new releases would have to apply all the actions in the repo. The trade-off is that a monorepo would make it faster to iterate on actions, particularly for actions that depend on one another (e.g. I can imagine that eventually there may be a reusable run-model workflow that depends on a setup-terraform action). An alternative would be to use a dedicated prefix like action-* for repos that represent actions, and keep them separate for versioning purposes. I'm somewhat torn and curious which pattern you prefer!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pattern in r-lib/actions works pretty well. Minor versions can change the code underlying the tag and only breaking changes would warrant a major version bump. I think it makes sense to basically factor out stuff into composite actions in their own sub-directories. Curious what you think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm down to try it! Conceptually it seems kind of strange to me that a new major version would be applied to all packages in the repo even if breaking changes were only introduced in a subset of those packages, but if it's a common pattern in practice then I think it makes sense for us to follow it.

description: Install and configure Terraform and AWS for the correct environment
inputs:
role-to-assume:
description: AWS IAM role to assume when running Terraform operations.
required: true
aws-account-id:
description: AWS account ID to use to create resources.
required: true
batch-container-image-name:
description: The name of the container image to use for the Batch job.
required: true
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved
role-duration-seconds:
description: How long the role specified by role-to-assume should be valid.
required: false
default: 3600
tfvars-file:
description: File to store Terraform variables.
required: false
default: terraform.tfvars
runs:
using: composite
steps:
- name: Mask sensitive AWS IDs from Terraform logs
run: |
echo "::add-mask::${{ inputs.role-to-assume }}"
echo "::add-mask::${{ inputs.aws-account-id }}"
shell: bash

- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ inputs.role-to-assume }}
aws-region: us-east-1
role-duration-seconds: ${{ inputs.role-duration-seconds }}

- name: Set up Terraform
uses: hashicorp/setup-terraform@v2

- name: Initialize Terraform
run: terraform init
shell: bash
working-directory: terraform

- name: Set Terraform variables
id: set-vars
run: |
# GITHUB_HEAD_REF is only set on pull_request events, so if it's
# present, we must be in a PR context
if [ -n "$GITHUB_HEAD_REF" ]; then
echo "On pull request branch, setting terraform workspace to CI"
# Replace slashes and underscores with hyphens in the workspace name
# and force it to lowercase, since we use it to name resources and
# we want to follow a consistent naming scheme
WORKSPACE="$(echo $GITHUB_HEAD_REF | \
sed -e 's/\//-/g' -e 's/_/-/g' | \
tr '[:upper:]' '[:lower:]')"
BATCH_JOB_NAME="ci_${WORKSPACE}_${GITHUB_REPOSITORY//\//-}"

elif [[ $GITHUB_REF_NAME == 'master' ]]; then
echo "On master branch, setting terraform workspace to prod"
WORKSPACE="prod"
BATCH_JOB_NAME="${GITHUB_REPOSITORY//\//-}"

else
echo "CI context did not match any of the expected environments"
exit 1
fi

{
echo "batch_job_name = \"$BATCH_JOB_NAME\"";
echo "batch_container_image_name = \"$BATCH_CONTAINER_IMAGE_NAME\"";
} > "$TFVARS_FILE"

echo "workspace=$WORKSPACE" >> "$GITHUB_OUTPUT"
shell: bash
working-directory: terraform
env:
TFVARS_FILE: ${{ inputs.tfvars-file }}
BATCH_CONTAINER_IMAGE_NAME: ${{ inputs.batch-container-image-name }}

- name: Select Terraform workspace
run: terraform workspace select -or-create "$WORKSPACE"
shell: bash
working-directory: terraform
env:
WORKSPACE: ${{ steps.set-vars.outputs.workspace }}
119 changes: 119 additions & 0 deletions .github/scripts/poll_batch_job_status.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
#!/usr/bin/env bash
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this logic is shared between the steps that check for job startup and job completion, I factored it out into a shared script.

# Poll an AWS Batch job to check its status.
#
# Takes two positional arguments:
#
# 1. (required) The ID of the Batch job to poll
# 2. (optional) An enum indicating which type of poll to run: `startup` or
# `completion`. Defaults to `completion`. If the value of the argument is
# `startup`, the script will treat the `RUNNING` status as a terminal
# success status. Otherwise, only `SUCCESS` will count as a terminal
# success status, and a status of `RUNNING` will cause the script to
# continue polling.
#
# Example usage:
#
# ./.github/scripts/poll_batch_job_status.sh 12345 startup
set -euo pipefail

BATCH_JOB_LOG_URL_PREFIX="https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/%2Faws%2Fbatch%2Fjob/log-events"
# How many times to poll AWS Batch job status while it's starting up before
# deciding to raise an error. Multiply by BATCH_JOB_POLL_INTERVAL_SECONDS to
# derive a timeout in second units. There is no equivalent timeout for running
# jobs, because those timeouts can be set on the Batch level, whereas startup
# timeouts are not controllable by Batch
BATCH_JOB_POLL_STARTUP_MAX_RETRIES=60
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could make sense to eventually make these attributes configurable, particularly once we factor this logic out into a shared action or workflow, but for the sake of faster iteration I decided to hardcode them for now since bash argument parsing is so annoying.


if [ -z ${1+x} ]; then
echo "Missing Batch job ID"
exit 1
fi

BATCH_JOB_ID="$1"

POLL_TYPE="completion"
# How long to wait between queries when polling
BATCH_JOB_POLL_INTERVAL_SECONDS=300 # 5 minutes

if [[ -n ${2+x} ]]; then
case "$2" in
"startup")
POLL_TYPE="startup"
BATCH_JOB_POLL_INTERVAL_SECONDS=60
;;

"completion")
;;

*)
echo "Positional argument must be one of 'startup' or 'completion', "
echo "got: '$2'"
exit 1
;;
esac
fi

echo "Polling for status of Batch job $BATCH_JOB_ID, waiting for $POLL_TYPE"


LOOP_COUNTER=0
while true; do
echo "Getting status of job $BATCH_JOB_ID"
JOB_DESCRIPTIONS=$(aws batch describe-jobs --jobs "$BATCH_JOB_ID")

JOB_LIST=$(echo "$JOB_DESCRIPTIONS" | jq -r '.jobs')
if [[ "$JOB_LIST" == "[]" ]]; then
echo "Unexpected empty response from aws batch describe-jobs"
exit 1
fi

JOB_STATUS=$(echo "$JOB_DESCRIPTIONS" | jq -r '.jobs[0].status')
echo "Job status is $JOB_STATUS"

JOB_LOG_STREAM_NAME=$(\
echo "$JOB_DESCRIPTIONS" | \
jq -r '.jobs[0].container.logStreamName' \
)
# Any slashes in the log stream name need to be urlencoded
JOB_LOG_URL="${BATCH_JOB_LOG_URL_PREFIX}/${JOB_LOG_STREAM_NAME//\//%2F}"

case "$JOB_STATUS" in
"RUNNING")
if [[ "$POLL_TYPE" == "startup" ]]; then
echo "Job has started! See logs: $JOB_LOG_URL"
exit 0
fi
;;

"SUCCEEDED")
echo "Job succeeded!"
exit 0
;;

"FAILED")
echo "Job failed :( See logs: $JOB_LOG_URL"
echo "More logs and container metrics can also be found on the "
echo "job detail page in the AWS Batch console"
exit 1
;;

*)
if [[ "$LOOP_COUNTER" == "$BATCH_JOB_POLL_STARTUP_MAX_RETRIES" ]]; then
echo "Failing workflow due to job startup timeout. This means "
echo "that the job did not enter a RUNNING state within a "
echo "reasonable amount of time. This usually indicates a "
echo "problem in the underlying ECS or EC2 backend that can "
echo "be debugged by checking cluster/instance logs in the "
echo "AWS console."
exit 1
fi
;;
esac

echo "Sleeping ${BATCH_JOB_POLL_INTERVAL_SECONDS}s until next status check"
sleep "$BATCH_JOB_POLL_INTERVAL_SECONDS"

LOOP_COUNTER=$((LOOP_COUNTER + 1))
echo "Starting status check #$LOOP_COUNTER"

done
33 changes: 33 additions & 0 deletions .github/workflows/cleanup-terraform.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Workflow that runs whenever a PR is closed or merged and deletes any
# AWS resources created by the `deploy` workflow.
name: cleanup-terraform
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved

on:
pull_request:
types: [closed]
dfsnow marked this conversation as resolved.
Show resolved Hide resolved

jobs:
cleanup-terraform:
runs-on: ubuntu-latest
# These permissions are needed to interact with GitHub's OIDC Token endpoint
# so that we can authenticate with AWS
permissions:
id-token: write
contents: read
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Setup Terraform
uses: ./.github/actions/setup-terraform
with:
role-to-assume: ${{ secrets.AWS_IAM_ROLE_TO_ASSUME_ARN }}
aws-account-id: ${{ secrets.AWS_ACCOUNT_ID }}
# This value can be anything, since Terraform doesn't need correct
# values for variables in order to destroy resources
batch-container-image-name: foobar

- name: Delete resources using Terraform
run: terraform destroy -auto-approve
working-directory: terraform
shell: bash
104 changes: 98 additions & 6 deletions .github/workflows/deploy.yaml
Original file line number Diff line number Diff line change
@@ -1,17 +1,29 @@
# Workflow that builds a Docker image containing the model code,
# pushes it to the GitHub Container Registry, and then optionally uses
# that container image to run the model using an AWS Batch job.
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved
#
# Images are built on every commit to a PR or main branch in order to ensure
# that the build continues to work properly, but Batch jobs are gated behind
# a `deploy` environment that requires manual approval from a
# @ccao-data/core-team member.

name: deploy

on:
dfsnow marked this conversation as resolved.
Show resolved Hide resolved
pull_request:
workflow_dispatch:
push:
dfsnow marked this conversation as resolved.
Show resolved Hide resolved
branches: [main, '*-assessment-year']
branches: [master]

env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
DOCKER_REGISTRY: ghcr.io
DOCKER_IMAGE_NAME: ${{ github.repository }}

jobs:
publish-docker-image:
runs-on: ubuntu-latest
outputs:
image-name: ${{ steps.save-image-name.outputs.image-name }}
steps:
- name: Checkout
uses: actions/checkout@v4
Expand All @@ -22,25 +34,26 @@ jobs:
- name: Login to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
registry: ${{ env.DOCKER_REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
images: ${{ env.DOCKER_REGISTRY }}/${{ env.DOCKER_IMAGE_NAME }}
# Tag the following types of images:
# * On a branch, tag with the branch name (e.g. `master`)
# * On a PR, tag with the PR number (e.g. `pr-12`)
# * On all events, tag with the short git SHA (e.g. `e956384`)
# * On a tagged commit, tag with the git tag (e.g. `2023`)
tags: |
type=ref,event=branch
type=ref,event=pr
type=ref,event=tag

- name: Build and push Docker image
id: build-and-push
uses: docker/build-push-action@v5
with:
context: .
Expand All @@ -49,4 +62,83 @@ jobs:
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
# Fix incorrect container type sidebar display in GitHub Container
# Registry
provenance: false

- name: Save image name to output
id: save-image-name
run: |
IMAGE_NAME=$(echo "$METADATA" | jq -r '."image.name"')
echo "image-name=$IMAGE_NAME" >> "$GITHUB_OUTPUT"
shell: bash
env:
METADATA: ${{ steps.build-and-push.outputs.metadata }}

run-model:
# Don't automatically run the model on push, since we prefer to use workflow
# dispatch for prod runs instead
if: github.event_name != 'push'
needs: [publish-docker-image]
runs-on: ubuntu-latest
# Require manual approval to run this job
environment: deploy
# These permissions are needed to interact with GitHub's OIDC Token endpoint
# so that we can authenticate with AWS
permissions:
id-token: write
contents: read
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Setup Terraform
uses: ./.github/actions/setup-terraform
with:
role-to-assume: ${{ secrets.AWS_IAM_ROLE_TO_ASSUME_ARN }}
aws-account-id: ${{ secrets.AWS_ACCOUNT_ID }}
batch-container-image-name: ${{ needs.publish-docker-image.outputs.image-name }}
role-duration-seconds: 14400 # Worst-case time for a full model run
dfsnow marked this conversation as resolved.
Show resolved Hide resolved

- name: Validate Terraform config
run: terraform validate
working-directory: terraform
shell: bash

- name: Apply Terraform changes
run: terraform apply -auto-approve
working-directory: terraform
shell: bash

- name: Submit new Batch job
id: submit-job
run: |
# Use terraform-bin to disable the wrapper script installed by
# the setup-terraform action, which adds extra context to
# `terraform output` calls
BATCH_JOB_NAME="$(terraform-bin output -raw batch_job_name)"
BATCH_JOB_DEFINITION="$(terraform-bin output -raw batch_job_definition_arn)"
BATCH_JOB_QUEUE="$(terraform-bin output -raw batch_job_queue_arn)"

BATCH_JOB_DETAILS=$(\
aws batch submit-job \
--job-name "$BATCH_JOB_NAME" \
--job-definition "$BATCH_JOB_DEFINITION" \
--job-queue "$BATCH_JOB_QUEUE" \
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved
)
BATCH_JOB_ID=$(echo $BATCH_JOB_DETAILS | jq -r ".jobId")
echo "batch-job-id=$BATCH_JOB_ID" >> "$GITHUB_OUTPUT"
shell: bash
working-directory: terraform

- name: Wait for Batch job to start and print link to AWS logs
run: ./.github/scripts/poll_batch_job_status.sh "$BATCH_JOB_ID" startup
shell: bash
env:
BATCH_JOB_ID: ${{ steps.submit-job.outputs.batch-job-id }}

- name: Wait for Batch job to complete
run: ./.github/scripts/poll_batch_job_status.sh "$BATCH_JOB_ID"
shell: bash
env:
BATCH_JOB_ID: ${{ steps.submit-job.outputs.batch-job-id }}
Loading