-
Notifications
You must be signed in to change notification settings - Fork 35
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* removed parameterisation of account_id * ml-ops component, catalogue and sample modules * Update gitignore and yml * Formatting changes to pass checks * remove acct refs * remove acct refs (2) * Amends to passing IAM role and adding name prefix * format fix to component module * Paramaterising of state machine input * Paramaterising of state machine input * Remove comments * Encode JSON for hyperparameter tuning input * Refactor MLOps, leverage AWS Lambda component module (#65) * replace zip files with py * move lambda files to catalog * remove extra comments * reference arns from lambda module * refactor vars to insulate lambda defs from s3_triggers * get arns from lambda module * refactor lambda function definitions * python cleanup and auto-formatting (using black) * fix source path * move lambda functions to ml ops * fix errors from refactoring * fix missing requirements.txt * Attempted bugfix: accessing non-existent pip[0] * typo * improved examples * Update components/aws/lambda-python/outputs.tf per suggested change Co-Authored-By: Jack Sandom <[email protected]> * updated variable name * updated variable name * output iam roles for ecs-task and lambda * Lambda IAM SageMaker policy attachment Co-authored-by: Jack Sandom <[email protected]> Co-authored-by: jacksandom <[email protected]> * updated auto-docs * add ml-ops module header * ECS Shap added and variable descriptions * updated docs * fix line endings * Normalize line endings * add ml-ops USAGE.md * Use BYO model and attrition data * Component format fix * Add batch transform step * addl docs on mlops module * auto-update readme docs * terraform fmt * fix merge error - missing 'functions' dev * add missing descriptions * terraform fmt * BYO model and Glue * Seperation of S3 buckets * Adds Lambda trigger * ml-ops USAGE.md file * ml-ops readme * glue readme * Add Glue crawler run and bucket updates * README updates and training image override * Update catalogue outputs * Glue WHL Readme * Outputs update and IAM destroy fix * Changes to sample module and sleep local resource * Clean up sample file * Fix whitespace * tfplan to gitignore * Re-ordering Step Functions * Add DynamoDB metadata store * Naming change and line endings * S3 as metadata store * Update usage markdown * resolved: non-deterministic resource count error * Change catalog name to 'ml-ops' * fixed case of no s3 triggers in lambda-python * Lambda docstrings and ECR kill switch, * Remove indent to ECR-image main * READMEs and formatting * Make score data optional for endpoint inference * Add MacOS / Linux ECR login command * add comment re: ECR creds Co-authored-by: Aaron Steers <[email protected]>
- Loading branch information
1 parent
2de2018
commit 781a7fc
Showing
72 changed files
with
5,500 additions
and
298 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,27 +1,30 @@ | ||
# Config file (comment this line to modify the template): | ||
samples/infra-config.yml | ||
build | ||
!build/README.md | ||
|
||
# Local .terraform directories | ||
.terraform | ||
|
||
**/secrets/** | ||
!**/secrets/*.md | ||
!**/secrets/*sample* | ||
!**/secrets/*template* | ||
|
||
**/.secrets/** | ||
!**/.secrets/*.md | ||
!**/.secrets/*sample* | ||
!**/.secrets/*template* | ||
|
||
# .tfstate files | ||
*.tfstate | ||
*.tfstate.* | ||
|
||
# .tfvars files | ||
*.tfvars | ||
|
||
# Other (Python) | ||
.mypy_cache | ||
# Config file (comment this line to modify the template): | ||
samples/infra-config.yml | ||
build | ||
!build/README.md | ||
|
||
# Local .terraform directories | ||
.terraform | ||
|
||
**/secrets/** | ||
!**/secrets/*.md | ||
!**/secrets/*sample* | ||
!**/secrets/*template* | ||
|
||
**/.secrets/** | ||
!**/.secrets/*.md | ||
!**/.secrets/*sample* | ||
!**/.secrets/*template* | ||
|
||
# .tfstate files | ||
*.tfstate | ||
*.tfstate.* | ||
|
||
# tfplan | ||
tfplan | ||
|
||
# .tfvars files | ||
*.tfvars | ||
|
||
# Other (Python) | ||
.mypy_cache |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
|
||
# AWS Ml-Ops | ||
|
||
`/catalog/aws/ml-ops` | ||
|
||
## Overview | ||
|
||
|
||
This module automates MLOps tasks associated with training Machine Learning models. | ||
|
||
The module leverages Step Functions and Lambda functions as needed. The state machine | ||
executes hyperparameter tuning, training, and deployments as needed. Deployment options | ||
supported are Sagemaker endpoints and/or batch inference. | ||
|
||
## Inputs | ||
|
||
| Name | Description | Type | Default | Required | | ||
|------|-------------|------|---------|:-----:| | ||
| built\_in\_model\_image | Tuning ranges for hyperparameters.<br>Specifying this means that 'bring-your-own' model is not required and the ECR image not created. | `string` | n/a | yes | | ||
| environment | Standard `environment` module input. | <pre>object({<br> vpc_id = string<br> aws_region = string<br> public_subnets = list(string)<br> private_subnets = list(string)<br> })</pre> | n/a | yes | | ||
| feature\_store\_override | Optionally, you can override the default feature store bucket with a bucket that already exists. | `string` | n/a | yes | | ||
| job\_name | Name prefix given to SageMaker model and training/tuning jobs (18 characters or less). | `string` | n/a | yes | | ||
| name\_prefix | Standard `name_prefix` module input. | `string` | n/a | yes | | ||
| resource\_tags | Standard `resource_tags` module input. | `map(string)` | n/a | yes | | ||
| batch\_transform\_instance\_count | Number of batch transformation instances. | `number` | `1` | no | | ||
| batch\_transform\_instance\_type | Instance type for batch inference. | `string` | `"ml.m4.xlarge"` | no | | ||
| byo\_model\_image\_name | Image and repo name for bring your own model. | `string` | `"byo-xgboost"` | no | | ||
| byo\_model\_image\_source\_path | Local source path for bring your own model docker image. | `string` | `"source/containers/ml-ops-byo-xgboost"` | no | | ||
| byo\_model\_image\_tag | Tag for bring your own model image. | `string` | `"latest"` | no | | ||
| endpoint\_instance\_count | Number of initial endpoint instances. | `number` | `1` | no | | ||
| endpoint\_instance\_type | Instance type for inference endpoint. | `string` | `"ml.m4.xlarge"` | no | | ||
| endpoint\_name | SageMaker inference endpoint to be created/updated. Endpoint will be created if<br>it does not already exist. | `string` | `"training-endpoint"` | no | | ||
| endpoint\_or\_batch\_transform | Choose whether to create/update an inference API endpoint or do batch inference on test data. | `string` | `"Batch Transform"` | no | | ||
| glue\_job\_name | Name of the Glue data transformation job name. | `string` | `"data-transformation"` | no | | ||
| glue\_job\_type | Type of Glue job (Spark or Python Shell). | `string` | `"pythonshell"` | no | | ||
| inference\_comparison\_operator | Comparison operator for deploying the trained SageMaker model.<br>Used in combination with `inference_metric_threshold`.<br>Examples: 'NumericGreaterThan', 'NumericLessThan', etc. | `string` | `"NumericGreaterThan"` | no | | ||
| inference\_metric\_threshold | Threshold for deploying the trained SageMaker model.<br>Used in combination with `inference_comparison_operator`. | `number` | `0.7` | no | | ||
| max\_number\_training\_jobs | Maximum number of total training jobs for hyperparameter tuning. | `number` | `3` | no | | ||
| max\_parallel\_training\_jobs | Maximimum number of training jobs running in parallel for hyperparameter tuning. | `number` | `1` | no | | ||
| parameter\_ranges | Tuning ranges for hyperparameters.<br>Expects a map of one or both "ContinuousParameterRanges" and "IntegerParameterRanges".<br>Each item in the map should point to a list of object with the following keys: - Name - name of the variable to be tuned - MinValue - min value of the range - MaxValue - max value of the range - ScalingType - 'Auto', 'Linear', 'Logarithmic', or 'ReverseLogarithmic' | <pre>map(list(object({<br> Name = string<br> MinValue = string<br> MaxValue = string<br> ScalingType = string<br> })))</pre> | <pre>{<br> "ContinuousParameterRanges": [<br> {<br> "MaxValue": "10",<br> "MinValue": "0",<br> "Name": "gamma",<br> "ScalingType": "Auto"<br> },<br> {<br> "MaxValue": "20",<br> "MinValue": "1",<br> "Name": "min_child_weight",<br> "ScalingType": "Auto"<br> },<br> {<br> "MaxValue": "0.5",<br> "MinValue": "0.1",<br> "Name": "subsample",<br> "ScalingType": "Auto"<br> },<br> {<br> "MaxValue": "1",<br> "MinValue": "0",<br> "Name": "max_delta_step",<br> "ScalingType": "Auto"<br> },<br> {<br> "MaxValue": "10",<br> "MinValue": "1",<br> "Name": "scale_pos_weight",<br> "ScalingType": "Auto"<br> }<br> ],<br> "IntegerParameterRanges": [<br> {<br> "MaxValue": "10",<br> "MinValue": "1",<br> "Name": "max_depth",<br> "ScalingType": "Auto"<br> }<br> ]<br>}</pre> | no | | ||
| score\_local\_path | Local path for scoring data. | `string` | `"source/data/score.csv"` | no | | ||
| script\_path | Local path for Glue Python script. | `string` | `"source/scripts/transform.py"` | no | | ||
| static\_hyperparameters | Map of hyperparameter names to static values, which should not be altered during hyperparameter tuning.<br>E.g. `{ "kfold_splits" = "5" }` | `map` | <pre>{<br> "kfold_splits": "5"<br>}</pre> | no | | ||
| train\_local\_path | Local path for training data. | `string` | `"source/data/train.csv"` | no | | ||
| training\_job\_instance\_count | Number of instances for training jobs. | `number` | `1` | no | | ||
| training\_job\_instance\_type | Instance type for training jobs. | `string` | `"ml.m4.xlarge"` | no | | ||
| training\_job\_storage\_in\_gb | Instance volume size in GB for training jobs. | `number` | `30` | no | | ||
| tuning\_metric | Hyperparameter tuning metric, e.g. 'error', 'auc', 'f1', 'accuracy'. | `string` | `"accuracy"` | no | | ||
| tuning\_objective | Hyperparameter tuning objective ('Minimize' or 'Maximize'). | `string` | `"Maximize"` | no | | ||
| whl\_path | Local path for Glue Python .whl file. | `string` | `"source/scripts/python/pandasmodule-0.1-py3-none-any.whl"` | no | | ||
|
||
## Outputs | ||
|
||
| Name | Description | | ||
|------|-------------| | ||
| summary | Summary of resources created by this module. | | ||
|
||
--------------------- | ||
|
||
_**NOTE:** This documentation was auto-generated using | ||
`terraform-docs` and `s-infra` from `slalom.dataops`. | ||
Please do not attempt to manually update this file._ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
## Usage | ||
|
||
### General Usage Instructions | ||
|
||
#### Prereqs: | ||
|
||
1. Create glue jobs (see sample code in `transform.py`). | ||
|
||
#### Terraform Config: | ||
|
||
1. If additional python dependencies are needed, list these in [TK] config variable. These will be packaged into python wheels (`.whl` files) and uploaded to S3 automatically. | ||
2. Configure terraform variable `script_path` with location of Glue transform code. | ||
|
||
#### Terraform Deploy: | ||
|
||
1. Run `terraform apply` which will create all resources and upload files to the correct bucket (enter 'yes' when prompted). | ||
|
||
#### Execute State Machine: | ||
|
||
1. Execute the state machine by landing first your training data and then your scoring (prediction) data into the feature store S3 bucket. | ||
|
||
### Bring Your Own Model | ||
|
||
_BYOM (Bring your own Model) allows you to build a custom docker image which will be used during state machine execution, in place of the generic training image._ | ||
|
||
For BYOM, perform all of the above and also the steps below. | ||
|
||
#### Additional Configuration | ||
|
||
Create a local folder in the code repository which contains at least the following files: | ||
|
||
* `Dockerfile` | ||
* `.Dockerignore` | ||
* `build_and_push.sh` | ||
* subfolder containing the following files: | ||
* Custom python: | ||
* `train` (with no file extension) | ||
* `predictor.py` | ||
* Generic / boilerplate (copy from standard sample): | ||
* `serve` (with no file extension) | ||
* `wsgi.py` (wrapper for gunicorn to find your app) | ||
* `nginx.conf` | ||
|
||
## File Stores Used by MLOps Module | ||
|
||
#### File Stores (S3 Buckets): | ||
|
||
1. Input Buckets: | ||
1. Feature Store - Input training and scoring data. | ||
2. Managed Buckets: | ||
1. Source Repository - Location where Glue python scripts are stored. | ||
2. Extract Store - Training data (model inputs) stored to be consumed by the training model. Default output location for the Glue transformation job(s). | ||
3. Model Store - Landing zone for pickled models as they are created and tuned by SageMaker training jobs. | ||
4. Metadata Store - For logging SageMaker metadata information about the tuning and training jobs. | ||
5. Output Store - Output from batch transformations (csv). Ignored when running endpoint inference. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# OPTIONAL: Only if using 'bring your own model' | ||
|
||
module "ecr_image_byo_model" { | ||
source = "../../../components/aws/ecr-image" | ||
name_prefix = var.name_prefix | ||
environment = var.environment | ||
resource_tags = var.resource_tags | ||
|
||
is_disabled = var.built_in_model_image != null ? true : false | ||
repository_name = var.byo_model_image_name | ||
source_image_path = var.byo_model_image_source_path | ||
tag = var.byo_model_image_tag | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
module "glue_crawler" { | ||
source = "../../../components/aws/glue-crawler" | ||
name_prefix = var.name_prefix | ||
environment = var.environment | ||
resource_tags = var.resource_tags | ||
|
||
glue_database_name = "${var.name_prefix}database" | ||
glue_crawler_name = "${var.name_prefix}glue-crawler" | ||
s3_target_bucket_name = aws_s3_bucket.output_store.id | ||
target_path = "batch-transform-output/" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
module "glue_job" { | ||
source = "../../../components/aws/glue-job" | ||
name_prefix = var.name_prefix | ||
environment = var.environment | ||
resource_tags = var.resource_tags | ||
|
||
job_type = var.glue_job_type | ||
s3_script_bucket_name = aws_s3_bucket.source_repository.id | ||
s3_source_bucket_name = var.feature_store_override != null ? data.aws_s3_bucket.feature_store_override[0].id : aws_s3_bucket.feature_store[0].id | ||
s3_destination_bucket_name = aws_s3_bucket.extracts_store.id | ||
script_path = "glue/transform.py" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
"""Check if SageMaker endpoint already exists and update the endpoint if it does. Otherwise create a new one.""" | ||
import boto3 | ||
import logging | ||
import json | ||
|
||
logger = logging.getLogger() | ||
logger.setLevel(logging.INFO) | ||
sm_client = boto3.client("sagemaker") | ||
|
||
|
||
def lambda_handler(event, context): | ||
endpointConfig = event["EndpointConfigArn"].split("/")[-1] | ||
|
||
create_or_update = "Create" | ||
response = sm_client.list_endpoints() | ||
for i in response["Endpoints"]: | ||
if i["EndpointName"] == event["EndpointName"]: | ||
create_or_update = "Update" | ||
return { | ||
"statusCode": 200, | ||
"endpointName": event["EndpointName"], | ||
"endpointConfig": endpointConfig, | ||
"CreateOrUpdate": create_or_update, | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
"""Execute the step function training and deployment pipeline triggered by new training data landing in S3.""" | ||
import boto3 | ||
import os | ||
|
||
state_machine_arn = os.environ['state_machine_arn'] | ||
client = boto3.client('stepfunctions') | ||
|
||
def lambda_handler(event, context): | ||
"""Execute ML State Machine when new training data is uploaded to S3.""" | ||
|
||
client.start_execution(stateMachineArn=state_machine_arn) | ||
|
||
return { | ||
"statusCode": 200 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
"""Return the name and path of the best model from the hyperparameter tuning job.""" | ||
def lambda_handler(event, context): | ||
print(event) | ||
return { | ||
"HyperParameterTuningJobName" : event["HyperParameterTuningJobName"], | ||
"bestTrainingJobName": event["BestTrainingJob"]["TrainingJobName"], | ||
"modelDataUrl": event["TrainingJobDefinition"]["OutputDataConfig"][ | ||
"S3OutputPath" | ||
] | ||
+ "/" | ||
+ event["BestTrainingJob"]["TrainingJobName"] | ||
+ "/output/model.tar.gz", | ||
} |
Oops, something went wrong.