MLOps modules (#62)

* removed parameterisation of account_id * ml-ops component, catalogue and sample modules * Update gitignore and yml * Formatting changes to pass checks * remove acct refs * remove acct refs (2) * Amends to passing IAM role and adding name prefix * format fix to component module * Paramaterising of state machine input * Paramaterising of state machine input * Remove comments * Encode JSON for hyperparameter tuning input * Refactor MLOps, leverage AWS Lambda component module (#65) * replace zip files with py * move lambda files to catalog * remove extra comments * reference arns from lambda module * refactor vars to insulate lambda defs from s3_triggers * get arns from lambda module * refactor lambda function definitions * python cleanup and auto-formatting (using black) * fix source path * move lambda functions to ml ops * fix errors from refactoring * fix missing requirements.txt * Attempted bugfix: accessing non-existent pip[0] * typo * improved examples * Update components/aws/lambda-python/outputs.tf per suggested change Co-Authored-By: Jack Sandom <[email protected]> * updated variable name * updated variable name * output iam roles for ecs-task and lambda * Lambda IAM SageMaker policy attachment Co-authored-by: Jack Sandom <[email protected]> Co-authored-by: jacksandom <[email protected]> * updated auto-docs * add ml-ops module header * ECS Shap added and variable descriptions * updated docs * fix line endings * Normalize line endings * add ml-ops USAGE.md * Use BYO model and attrition data * Component format fix * Add batch transform step * addl docs on mlops module * auto-update readme docs * terraform fmt * fix merge error - missing 'functions' dev * add missing descriptions * terraform fmt * BYO model and Glue * Seperation of S3 buckets * Adds Lambda trigger * ml-ops USAGE.md file * ml-ops readme * glue readme * Add Glue crawler run and bucket updates * README updates and training image override * Update catalogue outputs * Glue WHL Readme * Outputs update and IAM destroy fix * Changes to sample module and sleep local resource * Clean up sample file * Fix whitespace * tfplan to gitignore * Re-ordering Step Functions * Add DynamoDB metadata store * Naming change and line endings * S3 as metadata store * Update usage markdown * resolved: non-deterministic resource count error * Change catalog name to 'ml-ops' * fixed case of no s3 triggers in lambda-python * Lambda docstrings and ECR kill switch, * Remove indent to ECR-image main * READMEs and formatting * Make score data optional for endpoint inference * Add MacOS / Linux ECR login command * add comment re: ECR creds Co-authored-by: Aaron Steers <[email protected]>
slalom · Apr 28, 2020 · 781a7fc · 781a7fc
1 parent 2de2018
commit 781a7fc
Show file tree

Hide file tree

Showing 72 changed files with 5,500 additions and 298 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,27 +1,30 @@
-# Config file (comment this line to modify the template):
-samples/infra-config.yml
-build
-!build/README.md
-
-#  Local .terraform directories
-.terraform
-
-**/secrets/**
-!**/secrets/*.md
-!**/secrets/*sample*
-!**/secrets/*template*
-
-**/.secrets/**
-!**/.secrets/*.md
-!**/.secrets/*sample*
-!**/.secrets/*template*
-
-# .tfstate files
-*.tfstate
-*.tfstate.*
-
-# .tfvars files
-*.tfvars
-
-# Other (Python)
-.mypy_cache
+# Config file (comment this line to modify the template):
+samples/infra-config.yml
+build
+!build/README.md
+
+#  Local .terraform directories
+.terraform
+
+**/secrets/**
+!**/secrets/*.md
+!**/secrets/*sample*
+!**/secrets/*template*
+
+**/.secrets/**
+!**/.secrets/*.md
+!**/.secrets/*sample*
+!**/.secrets/*template*
+
+# .tfstate files
+*.tfstate
+*.tfstate.*
+
+# tfplan
+tfplan
+
+# .tfvars files
+*.tfvars
+
+# Other (Python)
+.mypy_cache
diff --git a/catalog/aws/data-lake/README.md b/catalog/aws/data-lake/README.md
@@ -18,7 +18,7 @@ trigger automatically when new content is added.
 | lambda\_python\_source | Local path to a folder containing the lambda source code (e.g. 'resources/fn\_log') | `string` | n/a | yes |
 | name\_prefix | Standard `name_prefix` module input. | `string` | n/a | yes |
 | resource\_tags | Standard `resource_tags` module input. | `map(string)` | n/a | yes |
-| s3\_triggers | List of S3 triggers objects, for example: [{   function\_name       = "fn\_log"   triggering\_path     = "\*"   function\_handler    = "main.lambda\_handler"   environment\_vars    = {}   environment\_secrets = {} }] | <pre>map(object({<br>    # function_name       = string<br>    triggering_path     = string<br>    function_handler    = string<br>    environment_vars    = map(string)<br>    environment_secrets = map(string)<br>  }))</pre> | `{}` | no |
+| s3\_triggers | List of S3 triggers objects, for example:<pre>[{<br>  function_name       = "fn_log"<br>  triggering_path     = "*"<br>  lambda_handler      = "main.lambda_handler"<br>  environment_vars    = {}<br>  environment_secrets = {}<br>}]</pre> | <pre>map(<br>    # function_name as map key<br>    object({<br>      triggering_path     = string<br>      lambda_handler      = string<br>      environment_vars    = map(string)<br>      environment_secrets = map(string)<br>    })<br>  )</pre> | `{}` | no |
 
 ## Outputs
 

diff --git a/catalog/aws/data-lake/main.tf b/catalog/aws/data-lake/main.tf
@@ -63,18 +63,31 @@ resource "aws_s3_bucket" "s3_logging_bucket" {
 }
 
 module "triggered_lambda" {
-  source                = "../../../components/aws/lambda-python"
-  name_prefix           = var.name_prefix
-  environment           = var.environment
-  s3_trigger_bucket     = local.data_bucket_name
-  s3_triggers           = var.s3_triggers
-  lambda_source_folder  = var.lambda_python_source
-  s3_path_to_lambda_zip = local.s3_path_to_lambda_zip
-  resource_tags         = var.resource_tags
+  source        = "../../../components/aws/lambda-python"
+  name_prefix   = var.name_prefix
+  resource_tags = var.resource_tags
+  environment   = var.environment
 
-  # depends_on = [
-  #   aws_s3_bucket.s3_data_bucket,
-  #   aws_s3_bucket.s3_logging_bucket,
-  #   aws_s3_bucket.s3_metadata_bucket
-  # ]
+  runtime              = "python3.8"
+  lambda_source_folder = var.lambda_python_source
+  upload_to_s3         = true
+  upload_to_s3_path    = local.s3_path_to_lambda_zip
+
+  functions = {
+    for name, def in var.s3_triggers :
+    name => {
+      description = "'${name}' trigger for data lake events"
+      handler     = def.lambda_handler
+      environment = def.environment_vars
+      secrets     = def.environment_secrets
+    }
+  }
+  s3_triggers = [
+    for name, trigger in var.s3_triggers :
+    {
+      function_name = name
+      s3_bucket     = local.data_bucket_name
+      s3_path       = trigger.triggering_path
+    }
+  ]
 }
diff --git a/catalog/aws/data-lake/variables.tf b/catalog/aws/data-lake/variables.tf
@@ -41,20 +41,24 @@ variable "lambda_python_source" {
 variable "s3_triggers" {
   description = <<EOF
 List of S3 triggers objects, for example:
+```
 [{
   function_name       = "fn_log"
   triggering_path     = "*"
-  function_handler    = "main.lambda_handler"
+  lambda_handler      = "main.lambda_handler"
   environment_vars    = {}
   environment_secrets = {}
 }]
+```
 EOF
-  type = map(object({
-    # function_name       = string
-    triggering_path     = string
-    function_handler    = string
-    environment_vars    = map(string)
-    environment_secrets = map(string)
-  }))
+  type = map(
+    # function_name as map key
+    object({
+      triggering_path     = string
+      lambda_handler      = string
+      environment_vars    = map(string)
+      environment_secrets = map(string)
+    })
+  )
   default = {}
 }
diff --git a/catalog/aws/ml-ops/README.md b/catalog/aws/ml-ops/README.md
@@ -0,0 +1,62 @@
+
+# AWS Ml-Ops
+
+`/catalog/aws/ml-ops`
+
+## Overview
+
+
+This module automates MLOps tasks associated with training Machine Learning models.
+
+The module leverages Step Functions and Lambda functions as needed. The state machine
+executes hyperparameter tuning, training, and deployments as needed. Deployment options
+supported are Sagemaker endpoints and/or batch inference.
+
+## Inputs
+
+| Name | Description | Type | Default | Required |
+|------|-------------|------|---------|:-----:|
+| built\_in\_model\_image | Tuning ranges for hyperparameters.<br>Specifying this means that 'bring-your-own' model is not required and the ECR image not created. | `string` | n/a | yes |
+| environment | Standard `environment` module input. | <pre>object({<br>    vpc_id          = string<br>    aws_region      = string<br>    public_subnets  = list(string)<br>    private_subnets = list(string)<br>  })</pre> | n/a | yes |
+| feature\_store\_override | Optionally, you can override the default feature store bucket with a bucket that already exists. | `string` | n/a | yes |
+| job\_name | Name prefix given to SageMaker model and training/tuning jobs (18 characters or less). | `string` | n/a | yes |
+| name\_prefix | Standard `name_prefix` module input. | `string` | n/a | yes |
+| resource\_tags | Standard `resource_tags` module input. | `map(string)` | n/a | yes |
+| batch\_transform\_instance\_count | Number of batch transformation instances. | `number` | `1` | no |
+| batch\_transform\_instance\_type | Instance type for batch inference. | `string` | `"ml.m4.xlarge"` | no |
+| byo\_model\_image\_name | Image and repo name for bring your own model. | `string` | `"byo-xgboost"` | no |
+| byo\_model\_image\_source\_path | Local source path for bring your own model docker image. | `string` | `"source/containers/ml-ops-byo-xgboost"` | no |
+| byo\_model\_image\_tag | Tag for bring your own model image. | `string` | `"latest"` | no |
+| endpoint\_instance\_count | Number of initial endpoint instances. | `number` | `1` | no |
+| endpoint\_instance\_type | Instance type for inference endpoint. | `string` | `"ml.m4.xlarge"` | no |
+| endpoint\_name | SageMaker inference endpoint to be created/updated. Endpoint will be created if<br>it does not already exist. | `string` | `"training-endpoint"` | no |
+| endpoint\_or\_batch\_transform | Choose whether to create/update an inference API endpoint or do batch inference on test data. | `string` | `"Batch Transform"` | no |
+| glue\_job\_name | Name of the Glue data transformation job name. | `string` | `"data-transformation"` | no |
+| glue\_job\_type | Type of Glue job (Spark or Python Shell). | `string` | `"pythonshell"` | no |
+| inference\_comparison\_operator | Comparison operator for deploying the trained SageMaker model.<br>Used in combination with `inference_metric_threshold`.<br>Examples: 'NumericGreaterThan', 'NumericLessThan', etc. | `string` | `"NumericGreaterThan"` | no |
+| inference\_metric\_threshold | Threshold for deploying the trained SageMaker model.<br>Used in combination with `inference_comparison_operator`. | `number` | `0.7` | no |
+| max\_number\_training\_jobs | Maximum number of total training jobs for hyperparameter tuning. | `number` | `3` | no |
+| max\_parallel\_training\_jobs | Maximimum number of training jobs running in parallel for hyperparameter tuning. | `number` | `1` | no |
+| parameter\_ranges | Tuning ranges for hyperparameters.<br>Expects a map of one or both "ContinuousParameterRanges" and "IntegerParameterRanges".<br>Each item in the map should point to a list of object with the following keys:  - Name        - name of the variable to be tuned  - MinValue    - min value of the range  - MaxValue    - max value of the range  - ScalingType - 'Auto', 'Linear', 'Logarithmic', or 'ReverseLogarithmic' | <pre>map(list(object({<br>    Name        = string<br>    MinValue    = string<br>    MaxValue    = string<br>    ScalingType = string<br>  })))</pre> | <pre>{<br>  "ContinuousParameterRanges": [<br>    {<br>      "MaxValue": "10",<br>      "MinValue": "0",<br>      "Name": "gamma",<br>      "ScalingType": "Auto"<br>    },<br>    {<br>      "MaxValue": "20",<br>      "MinValue": "1",<br>      "Name": "min_child_weight",<br>      "ScalingType": "Auto"<br>    },<br>    {<br>      "MaxValue": "0.5",<br>      "MinValue": "0.1",<br>      "Name": "subsample",<br>      "ScalingType": "Auto"<br>    },<br>    {<br>      "MaxValue": "1",<br>      "MinValue": "0",<br>      "Name": "max_delta_step",<br>      "ScalingType": "Auto"<br>    },<br>    {<br>      "MaxValue": "10",<br>      "MinValue": "1",<br>      "Name": "scale_pos_weight",<br>      "ScalingType": "Auto"<br>    }<br>  ],<br>  "IntegerParameterRanges": [<br>    {<br>      "MaxValue": "10",<br>      "MinValue": "1",<br>      "Name": "max_depth",<br>      "ScalingType": "Auto"<br>    }<br>  ]<br>}</pre> | no |
+| score\_local\_path | Local path for scoring data. | `string` | `"source/data/score.csv"` | no |
+| script\_path | Local path for Glue Python script. | `string` | `"source/scripts/transform.py"` | no |
+| static\_hyperparameters | Map of hyperparameter names to static values, which should not be altered during hyperparameter tuning.<br>E.g. `{ "kfold_splits" = "5" }` | `map` | <pre>{<br>  "kfold_splits": "5"<br>}</pre> | no |
+| train\_local\_path | Local path for training data. | `string` | `"source/data/train.csv"` | no |
+| training\_job\_instance\_count | Number of instances for training jobs. | `number` | `1` | no |
+| training\_job\_instance\_type | Instance type for training jobs. | `string` | `"ml.m4.xlarge"` | no |
+| training\_job\_storage\_in\_gb | Instance volume size in GB for training jobs. | `number` | `30` | no |
+| tuning\_metric | Hyperparameter tuning metric, e.g. 'error', 'auc', 'f1', 'accuracy'. | `string` | `"accuracy"` | no |
+| tuning\_objective | Hyperparameter tuning objective ('Minimize' or 'Maximize'). | `string` | `"Maximize"` | no |
+| whl\_path | Local path for Glue Python .whl file. | `string` | `"source/scripts/python/pandasmodule-0.1-py3-none-any.whl"` | no |
+
+## Outputs
+
+| Name | Description |
+|------|-------------|
+| summary | Summary of resources created by this module. |
+
+---------------------
+
+_**NOTE:** This documentation was auto-generated using
+`terraform-docs` and `s-infra` from `slalom.dataops`.
+Please do not attempt to manually update this file._
diff --git a/catalog/aws/ml-ops/USAGE.md b/catalog/aws/ml-ops/USAGE.md
@@ -0,0 +1,55 @@
+## Usage
+
+### General Usage Instructions
+
+#### Prereqs:
+
+1. Create glue jobs (see sample code in `transform.py`).
+
+#### Terraform Config:
+
+1. If additional python dependencies are needed, list these in [TK] config variable. These will be packaged into python wheels (`.whl` files) and uploaded to S3 automatically.
+2. Configure terraform variable `script_path` with location of Glue transform code.
+
+#### Terraform Deploy:
+
+1. Run `terraform apply` which will create all resources and upload files to the correct bucket (enter 'yes' when prompted).
+
+#### Execute State Machine:
+
+1. Execute the state machine by landing first your training data and then your scoring (prediction) data into the feature store S3 bucket.
+
+### Bring Your Own Model
+
+_BYOM (Bring your own Model) allows you to build a custom docker image which will be used during state machine execution, in place of the generic training image._
+
+For BYOM, perform all of the above and also the steps below.
+
+#### Additional Configuration
+
+Create a local folder in the code repository which contains at least the following files:
+
+    * `Dockerfile`
+    * `.Dockerignore`
+    * `build_and_push.sh`
+    * subfolder containing the following files:
+      * Custom python:
+        * `train` (with no file extension)
+        * `predictor.py`
+      * Generic / boilerplate (copy from standard sample):
+        * `serve` (with no file extension)
+        * `wsgi.py` (wrapper for gunicorn to find your app)
+        * `nginx.conf`
+
+## File Stores Used by MLOps Module
+
+#### File Stores (S3 Buckets):
+
+1. Input Buckets:
+   1. Feature Store - Input training and scoring data.
+2. Managed Buckets:
+   1. Source Repository - Location where Glue python scripts are stored.
+   2. Extract Store - Training data (model inputs) stored to be consumed by the training model. Default output location for the Glue transformation job(s).
+   3. Model Store - Landing zone for pickled models as they are created and tuned by SageMaker training jobs.
+   4. Metadata Store - For logging SageMaker metadata information about the tuning and training jobs.
+   5. Output Store - Output from batch transformations (csv). Ignored when running endpoint inference.
diff --git a/catalog/aws/ml-ops/ecr-image.tf b/catalog/aws/ml-ops/ecr-image.tf
@@ -0,0 +1,13 @@
+# OPTIONAL: Only if using 'bring your own model'
+
+module "ecr_image_byo_model" {
+  source        = "../../../components/aws/ecr-image"
+  name_prefix   = var.name_prefix
+  environment   = var.environment
+  resource_tags = var.resource_tags
+
+  is_disabled       = var.built_in_model_image != null ? true : false
+  repository_name   = var.byo_model_image_name
+  source_image_path = var.byo_model_image_source_path
+  tag               = var.byo_model_image_tag
+}
diff --git a/catalog/aws/ml-ops/glue-crawler.tf b/catalog/aws/ml-ops/glue-crawler.tf
@@ -0,0 +1,11 @@
+module "glue_crawler" {
+  source        = "../../../components/aws/glue-crawler"
+  name_prefix   = var.name_prefix
+  environment   = var.environment
+  resource_tags = var.resource_tags
+
+  glue_database_name    = "${var.name_prefix}database"
+  glue_crawler_name     = "${var.name_prefix}glue-crawler"
+  s3_target_bucket_name = aws_s3_bucket.output_store.id
+  target_path           = "batch-transform-output/"
+}
diff --git a/catalog/aws/ml-ops/glue-job.tf b/catalog/aws/ml-ops/glue-job.tf
@@ -0,0 +1,12 @@
+module "glue_job" {
+  source        = "../../../components/aws/glue-job"
+  name_prefix   = var.name_prefix
+  environment   = var.environment
+  resource_tags = var.resource_tags
+
+  job_type                   = var.glue_job_type
+  s3_script_bucket_name      = aws_s3_bucket.source_repository.id
+  s3_source_bucket_name      = var.feature_store_override != null ? data.aws_s3_bucket.feature_store_override[0].id : aws_s3_bucket.feature_store[0].id
+  s3_destination_bucket_name = aws_s3_bucket.extracts_store.id
+  script_path                = "glue/transform.py"
+}
diff --git a/catalog/aws/ml-ops/lambda-python/check_endpoint_exists.py b/catalog/aws/ml-ops/lambda-python/check_endpoint_exists.py
@@ -0,0 +1,24 @@
+"""Check if SageMaker endpoint already exists and update the endpoint if it does. Otherwise create a new one."""
+import boto3
+import logging
+import json
+
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
+sm_client = boto3.client("sagemaker")
+
+
+def lambda_handler(event, context):
+    endpointConfig = event["EndpointConfigArn"].split("/")[-1]
+
+    create_or_update = "Create"
+    response = sm_client.list_endpoints()
+    for i in response["Endpoints"]:
+        if i["EndpointName"] == event["EndpointName"]:
+            create_or_update = "Update"
+    return {
+        "statusCode": 200,
+        "endpointName": event["EndpointName"],
+        "endpointConfig": endpointConfig,
+        "CreateOrUpdate": create_or_update,
+    }
diff --git a/catalog/aws/ml-ops/lambda-python/execute_state_machine.py b/catalog/aws/ml-ops/lambda-python/execute_state_machine.py
@@ -0,0 +1,15 @@
+"""Execute the step function training and deployment pipeline triggered by new training data landing in S3."""
+import boto3
+import os
+
+state_machine_arn = os.environ['state_machine_arn']
+client = boto3.client('stepfunctions')
+
+def lambda_handler(event, context):
+    """Execute ML State Machine when new training data is uploaded to S3."""
+
+    client.start_execution(stateMachineArn=state_machine_arn)
+
+    return {
+        "statusCode": 200
+    }
diff --git a/catalog/aws/ml-ops/lambda-python/extract_model_path.py b/catalog/aws/ml-ops/lambda-python/extract_model_path.py
@@ -0,0 +1,13 @@
+"""Return the name and path of the best model from the hyperparameter tuning job."""
+def lambda_handler(event, context):
+    print(event)
+    return {
+        "HyperParameterTuningJobName" : event["HyperParameterTuningJobName"],
+        "bestTrainingJobName": event["BestTrainingJob"]["TrainingJobName"],
+        "modelDataUrl": event["TrainingJobDefinition"]["OutputDataConfig"][
+            "S3OutputPath"
+        ]
+        + "/"
+        + event["BestTrainingJob"]["TrainingJobName"]
+        + "/output/model.tar.gz",
+    }