Skip to content

Commit

Permalink
MLOps modules (#62)
Browse files Browse the repository at this point in the history
* removed parameterisation of account_id

* ml-ops component, catalogue and sample modules

* Update gitignore and yml

* Formatting changes to pass checks

* remove acct refs

* remove acct refs (2)

* Amends to passing IAM role and adding name prefix

* format fix to component module

* Paramaterising of state machine input

* Paramaterising of state machine input

* Remove comments

* Encode JSON for hyperparameter tuning input

* Refactor MLOps, leverage AWS Lambda component module (#65)

* replace zip files with py

* move lambda files to catalog

* remove extra comments

* reference arns from lambda module

* refactor vars to insulate lambda defs from s3_triggers

* get arns from lambda module

* refactor lambda function definitions

* python cleanup and auto-formatting (using black)

* fix source path

* move lambda functions to ml ops

* fix errors from refactoring

* fix missing requirements.txt

* Attempted bugfix: accessing non-existent pip[0]

* typo

* improved examples

* Update components/aws/lambda-python/outputs.tf

per suggested change

Co-Authored-By: Jack Sandom <[email protected]>

* updated variable name

* updated variable name

* output iam roles for ecs-task and lambda

* Lambda IAM SageMaker policy attachment

Co-authored-by: Jack Sandom <[email protected]>
Co-authored-by: jacksandom <[email protected]>

* updated auto-docs

* add ml-ops module header

* ECS Shap added and variable descriptions

* updated docs

* fix line endings

* Normalize line endings

* add ml-ops USAGE.md

* Use BYO model and attrition data

* Component format fix

* Add batch transform step

* addl docs on mlops module

* auto-update readme docs

* terraform fmt

* fix merge error - missing 'functions' dev

* add missing descriptions

* terraform fmt

* BYO model and Glue

* Seperation of S3 buckets

* Adds Lambda trigger

* ml-ops USAGE.md file

* ml-ops readme

* glue readme

* Add Glue crawler run and bucket updates

* README updates and training image override

* Update catalogue outputs

* Glue WHL Readme

* Outputs update and IAM destroy fix

* Changes to sample module and sleep local resource

* Clean up sample file

* Fix whitespace

* tfplan to gitignore

* Re-ordering Step Functions

* Add DynamoDB metadata store

* Naming change and line endings

* S3 as metadata store

* Update usage markdown

* resolved: non-deterministic resource count error

* Change catalog name to 'ml-ops'

* fixed case of no s3 triggers in lambda-python

* Lambda docstrings and ECR kill switch,

* Remove indent to ECR-image main

* READMEs and formatting

* Make score data optional for endpoint inference

* Add MacOS / Linux ECR login command

* add comment re: ECR creds

Co-authored-by: Aaron Steers <[email protected]>
  • Loading branch information
jacksandom and aaronsteers authored Apr 28, 2020
1 parent 2de2018 commit 781a7fc
Show file tree
Hide file tree
Showing 72 changed files with 5,500 additions and 298 deletions.
57 changes: 30 additions & 27 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,27 +1,30 @@
# Config file (comment this line to modify the template):
samples/infra-config.yml
build
!build/README.md

# Local .terraform directories
.terraform

**/secrets/**
!**/secrets/*.md
!**/secrets/*sample*
!**/secrets/*template*

**/.secrets/**
!**/.secrets/*.md
!**/.secrets/*sample*
!**/.secrets/*template*

# .tfstate files
*.tfstate
*.tfstate.*

# .tfvars files
*.tfvars

# Other (Python)
.mypy_cache
# Config file (comment this line to modify the template):
samples/infra-config.yml
build
!build/README.md

# Local .terraform directories
.terraform

**/secrets/**
!**/secrets/*.md
!**/secrets/*sample*
!**/secrets/*template*

**/.secrets/**
!**/.secrets/*.md
!**/.secrets/*sample*
!**/.secrets/*template*

# .tfstate files
*.tfstate
*.tfstate.*

# tfplan
tfplan

# .tfvars files
*.tfvars

# Other (Python)
.mypy_cache
2 changes: 1 addition & 1 deletion catalog/aws/data-lake/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ trigger automatically when new content is added.
| lambda\_python\_source | Local path to a folder containing the lambda source code (e.g. 'resources/fn\_log') | `string` | n/a | yes |
| name\_prefix | Standard `name_prefix` module input. | `string` | n/a | yes |
| resource\_tags | Standard `resource_tags` module input. | `map(string)` | n/a | yes |
| s3\_triggers | List of S3 triggers objects, for example: [{ function\_name = "fn\_log" triggering\_path = "\*" function\_handler = "main.lambda\_handler" environment\_vars = {} environment\_secrets = {} }] | <pre>map(object({<br> # function_name = string<br> triggering_path = string<br> function_handler = string<br> environment_vars = map(string)<br> environment_secrets = map(string)<br> }))</pre> | `{}` | no |
| s3\_triggers | List of S3 triggers objects, for example:<pre>[{<br> function_name = "fn_log"<br> triggering_path = "*"<br> lambda_handler = "main.lambda_handler"<br> environment_vars = {}<br> environment_secrets = {}<br>}]</pre> | <pre>map(<br> # function_name as map key<br> object({<br> triggering_path = string<br> lambda_handler = string<br> environment_vars = map(string)<br> environment_secrets = map(string)<br> })<br> )</pre> | `{}` | no |

## Outputs

Expand Down
39 changes: 26 additions & 13 deletions catalog/aws/data-lake/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -63,18 +63,31 @@ resource "aws_s3_bucket" "s3_logging_bucket" {
}

module "triggered_lambda" {
source = "../../../components/aws/lambda-python"
name_prefix = var.name_prefix
environment = var.environment
s3_trigger_bucket = local.data_bucket_name
s3_triggers = var.s3_triggers
lambda_source_folder = var.lambda_python_source
s3_path_to_lambda_zip = local.s3_path_to_lambda_zip
resource_tags = var.resource_tags
source = "../../../components/aws/lambda-python"
name_prefix = var.name_prefix
resource_tags = var.resource_tags
environment = var.environment

# depends_on = [
# aws_s3_bucket.s3_data_bucket,
# aws_s3_bucket.s3_logging_bucket,
# aws_s3_bucket.s3_metadata_bucket
# ]
runtime = "python3.8"
lambda_source_folder = var.lambda_python_source
upload_to_s3 = true
upload_to_s3_path = local.s3_path_to_lambda_zip

functions = {
for name, def in var.s3_triggers :
name => {
description = "'${name}' trigger for data lake events"
handler = def.lambda_handler
environment = def.environment_vars
secrets = def.environment_secrets
}
}
s3_triggers = [
for name, trigger in var.s3_triggers :
{
function_name = name
s3_bucket = local.data_bucket_name
s3_path = trigger.triggering_path
}
]
}
20 changes: 12 additions & 8 deletions catalog/aws/data-lake/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -41,20 +41,24 @@ variable "lambda_python_source" {
variable "s3_triggers" {
description = <<EOF
List of S3 triggers objects, for example:
```
[{
function_name = "fn_log"
triggering_path = "*"
function_handler = "main.lambda_handler"
lambda_handler = "main.lambda_handler"
environment_vars = {}
environment_secrets = {}
}]
```
EOF
type = map(object({
# function_name = string
triggering_path = string
function_handler = string
environment_vars = map(string)
environment_secrets = map(string)
}))
type = map(
# function_name as map key
object({
triggering_path = string
lambda_handler = string
environment_vars = map(string)
environment_secrets = map(string)
})
)
default = {}
}
62 changes: 62 additions & 0 deletions catalog/aws/ml-ops/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@

# AWS Ml-Ops

`/catalog/aws/ml-ops`

## Overview


This module automates MLOps tasks associated with training Machine Learning models.

The module leverages Step Functions and Lambda functions as needed. The state machine
executes hyperparameter tuning, training, and deployments as needed. Deployment options
supported are Sagemaker endpoints and/or batch inference.

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:-----:|
| built\_in\_model\_image | Tuning ranges for hyperparameters.<br>Specifying this means that 'bring-your-own' model is not required and the ECR image not created. | `string` | n/a | yes |
| environment | Standard `environment` module input. | <pre>object({<br> vpc_id = string<br> aws_region = string<br> public_subnets = list(string)<br> private_subnets = list(string)<br> })</pre> | n/a | yes |
| feature\_store\_override | Optionally, you can override the default feature store bucket with a bucket that already exists. | `string` | n/a | yes |
| job\_name | Name prefix given to SageMaker model and training/tuning jobs (18 characters or less). | `string` | n/a | yes |
| name\_prefix | Standard `name_prefix` module input. | `string` | n/a | yes |
| resource\_tags | Standard `resource_tags` module input. | `map(string)` | n/a | yes |
| batch\_transform\_instance\_count | Number of batch transformation instances. | `number` | `1` | no |
| batch\_transform\_instance\_type | Instance type for batch inference. | `string` | `"ml.m4.xlarge"` | no |
| byo\_model\_image\_name | Image and repo name for bring your own model. | `string` | `"byo-xgboost"` | no |
| byo\_model\_image\_source\_path | Local source path for bring your own model docker image. | `string` | `"source/containers/ml-ops-byo-xgboost"` | no |
| byo\_model\_image\_tag | Tag for bring your own model image. | `string` | `"latest"` | no |
| endpoint\_instance\_count | Number of initial endpoint instances. | `number` | `1` | no |
| endpoint\_instance\_type | Instance type for inference endpoint. | `string` | `"ml.m4.xlarge"` | no |
| endpoint\_name | SageMaker inference endpoint to be created/updated. Endpoint will be created if<br>it does not already exist. | `string` | `"training-endpoint"` | no |
| endpoint\_or\_batch\_transform | Choose whether to create/update an inference API endpoint or do batch inference on test data. | `string` | `"Batch Transform"` | no |
| glue\_job\_name | Name of the Glue data transformation job name. | `string` | `"data-transformation"` | no |
| glue\_job\_type | Type of Glue job (Spark or Python Shell). | `string` | `"pythonshell"` | no |
| inference\_comparison\_operator | Comparison operator for deploying the trained SageMaker model.<br>Used in combination with `inference_metric_threshold`.<br>Examples: 'NumericGreaterThan', 'NumericLessThan', etc. | `string` | `"NumericGreaterThan"` | no |
| inference\_metric\_threshold | Threshold for deploying the trained SageMaker model.<br>Used in combination with `inference_comparison_operator`. | `number` | `0.7` | no |
| max\_number\_training\_jobs | Maximum number of total training jobs for hyperparameter tuning. | `number` | `3` | no |
| max\_parallel\_training\_jobs | Maximimum number of training jobs running in parallel for hyperparameter tuning. | `number` | `1` | no |
| parameter\_ranges | Tuning ranges for hyperparameters.<br>Expects a map of one or both "ContinuousParameterRanges" and "IntegerParameterRanges".<br>Each item in the map should point to a list of object with the following keys: - Name - name of the variable to be tuned - MinValue - min value of the range - MaxValue - max value of the range - ScalingType - 'Auto', 'Linear', 'Logarithmic', or 'ReverseLogarithmic' | <pre>map(list(object({<br> Name = string<br> MinValue = string<br> MaxValue = string<br> ScalingType = string<br> })))</pre> | <pre>{<br> "ContinuousParameterRanges": [<br> {<br> "MaxValue": "10",<br> "MinValue": "0",<br> "Name": "gamma",<br> "ScalingType": "Auto"<br> },<br> {<br> "MaxValue": "20",<br> "MinValue": "1",<br> "Name": "min_child_weight",<br> "ScalingType": "Auto"<br> },<br> {<br> "MaxValue": "0.5",<br> "MinValue": "0.1",<br> "Name": "subsample",<br> "ScalingType": "Auto"<br> },<br> {<br> "MaxValue": "1",<br> "MinValue": "0",<br> "Name": "max_delta_step",<br> "ScalingType": "Auto"<br> },<br> {<br> "MaxValue": "10",<br> "MinValue": "1",<br> "Name": "scale_pos_weight",<br> "ScalingType": "Auto"<br> }<br> ],<br> "IntegerParameterRanges": [<br> {<br> "MaxValue": "10",<br> "MinValue": "1",<br> "Name": "max_depth",<br> "ScalingType": "Auto"<br> }<br> ]<br>}</pre> | no |
| score\_local\_path | Local path for scoring data. | `string` | `"source/data/score.csv"` | no |
| script\_path | Local path for Glue Python script. | `string` | `"source/scripts/transform.py"` | no |
| static\_hyperparameters | Map of hyperparameter names to static values, which should not be altered during hyperparameter tuning.<br>E.g. `{ "kfold_splits" = "5" }` | `map` | <pre>{<br> "kfold_splits": "5"<br>}</pre> | no |
| train\_local\_path | Local path for training data. | `string` | `"source/data/train.csv"` | no |
| training\_job\_instance\_count | Number of instances for training jobs. | `number` | `1` | no |
| training\_job\_instance\_type | Instance type for training jobs. | `string` | `"ml.m4.xlarge"` | no |
| training\_job\_storage\_in\_gb | Instance volume size in GB for training jobs. | `number` | `30` | no |
| tuning\_metric | Hyperparameter tuning metric, e.g. 'error', 'auc', 'f1', 'accuracy'. | `string` | `"accuracy"` | no |
| tuning\_objective | Hyperparameter tuning objective ('Minimize' or 'Maximize'). | `string` | `"Maximize"` | no |
| whl\_path | Local path for Glue Python .whl file. | `string` | `"source/scripts/python/pandasmodule-0.1-py3-none-any.whl"` | no |

## Outputs

| Name | Description |
|------|-------------|
| summary | Summary of resources created by this module. |

---------------------

_**NOTE:** This documentation was auto-generated using
`terraform-docs` and `s-infra` from `slalom.dataops`.
Please do not attempt to manually update this file._
55 changes: 55 additions & 0 deletions catalog/aws/ml-ops/USAGE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
## Usage

### General Usage Instructions

#### Prereqs:

1. Create glue jobs (see sample code in `transform.py`).

#### Terraform Config:

1. If additional python dependencies are needed, list these in [TK] config variable. These will be packaged into python wheels (`.whl` files) and uploaded to S3 automatically.
2. Configure terraform variable `script_path` with location of Glue transform code.

#### Terraform Deploy:

1. Run `terraform apply` which will create all resources and upload files to the correct bucket (enter 'yes' when prompted).

#### Execute State Machine:

1. Execute the state machine by landing first your training data and then your scoring (prediction) data into the feature store S3 bucket.

### Bring Your Own Model

_BYOM (Bring your own Model) allows you to build a custom docker image which will be used during state machine execution, in place of the generic training image._

For BYOM, perform all of the above and also the steps below.

#### Additional Configuration

Create a local folder in the code repository which contains at least the following files:

* `Dockerfile`
* `.Dockerignore`
* `build_and_push.sh`
* subfolder containing the following files:
* Custom python:
* `train` (with no file extension)
* `predictor.py`
* Generic / boilerplate (copy from standard sample):
* `serve` (with no file extension)
* `wsgi.py` (wrapper for gunicorn to find your app)
* `nginx.conf`

## File Stores Used by MLOps Module

#### File Stores (S3 Buckets):

1. Input Buckets:
1. Feature Store - Input training and scoring data.
2. Managed Buckets:
1. Source Repository - Location where Glue python scripts are stored.
2. Extract Store - Training data (model inputs) stored to be consumed by the training model. Default output location for the Glue transformation job(s).
3. Model Store - Landing zone for pickled models as they are created and tuned by SageMaker training jobs.
4. Metadata Store - For logging SageMaker metadata information about the tuning and training jobs.
5. Output Store - Output from batch transformations (csv). Ignored when running endpoint inference.
13 changes: 13 additions & 0 deletions catalog/aws/ml-ops/ecr-image.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# OPTIONAL: Only if using 'bring your own model'

module "ecr_image_byo_model" {
source = "../../../components/aws/ecr-image"
name_prefix = var.name_prefix
environment = var.environment
resource_tags = var.resource_tags

is_disabled = var.built_in_model_image != null ? true : false
repository_name = var.byo_model_image_name
source_image_path = var.byo_model_image_source_path
tag = var.byo_model_image_tag
}
11 changes: 11 additions & 0 deletions catalog/aws/ml-ops/glue-crawler.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
module "glue_crawler" {
source = "../../../components/aws/glue-crawler"
name_prefix = var.name_prefix
environment = var.environment
resource_tags = var.resource_tags

glue_database_name = "${var.name_prefix}database"
glue_crawler_name = "${var.name_prefix}glue-crawler"
s3_target_bucket_name = aws_s3_bucket.output_store.id
target_path = "batch-transform-output/"
}
12 changes: 12 additions & 0 deletions catalog/aws/ml-ops/glue-job.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
module "glue_job" {
source = "../../../components/aws/glue-job"
name_prefix = var.name_prefix
environment = var.environment
resource_tags = var.resource_tags

job_type = var.glue_job_type
s3_script_bucket_name = aws_s3_bucket.source_repository.id
s3_source_bucket_name = var.feature_store_override != null ? data.aws_s3_bucket.feature_store_override[0].id : aws_s3_bucket.feature_store[0].id
s3_destination_bucket_name = aws_s3_bucket.extracts_store.id
script_path = "glue/transform.py"
}
24 changes: 24 additions & 0 deletions catalog/aws/ml-ops/lambda-python/check_endpoint_exists.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
"""Check if SageMaker endpoint already exists and update the endpoint if it does. Otherwise create a new one."""
import boto3
import logging
import json

logger = logging.getLogger()
logger.setLevel(logging.INFO)
sm_client = boto3.client("sagemaker")


def lambda_handler(event, context):
endpointConfig = event["EndpointConfigArn"].split("/")[-1]

create_or_update = "Create"
response = sm_client.list_endpoints()
for i in response["Endpoints"]:
if i["EndpointName"] == event["EndpointName"]:
create_or_update = "Update"
return {
"statusCode": 200,
"endpointName": event["EndpointName"],
"endpointConfig": endpointConfig,
"CreateOrUpdate": create_or_update,
}
15 changes: 15 additions & 0 deletions catalog/aws/ml-ops/lambda-python/execute_state_machine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
"""Execute the step function training and deployment pipeline triggered by new training data landing in S3."""
import boto3
import os

state_machine_arn = os.environ['state_machine_arn']
client = boto3.client('stepfunctions')

def lambda_handler(event, context):
"""Execute ML State Machine when new training data is uploaded to S3."""

client.start_execution(stateMachineArn=state_machine_arn)

return {
"statusCode": 200
}
13 changes: 13 additions & 0 deletions catalog/aws/ml-ops/lambda-python/extract_model_path.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
"""Return the name and path of the best model from the hyperparameter tuning job."""
def lambda_handler(event, context):
print(event)
return {
"HyperParameterTuningJobName" : event["HyperParameterTuningJobName"],
"bestTrainingJobName": event["BestTrainingJob"]["TrainingJobName"],
"modelDataUrl": event["TrainingJobDefinition"]["OutputDataConfig"][
"S3OutputPath"
]
+ "/"
+ event["BestTrainingJob"]["TrainingJobName"]
+ "/output/model.tar.gz",
}
Loading

0 comments on commit 781a7fc

Please sign in to comment.