refactor deps, update docs

slalom · May 14, 2020 · 374bdb5 · 374bdb5
1 parent f9eae20
commit 374bdb5
Show file tree

Hide file tree

Showing 18 changed files with 280 additions and 28 deletions.
diff --git a/catalog/README.md b/catalog/README.md
@@ -10,6 +10,7 @@ The Infrastructure Catalog contains ready-to-deploy terraform modules for a vari
     - [AWS Data-Lake](#aws-data-lake)
     - [AWS DBT](#aws-dbt)
     - [AWS Environment](#aws-environment)
+    - [AWS ML-Ops](#aws-ml-ops)
     - [AWS MySQL](#aws-mysql)
     - [AWS Postgres](#aws-postgres)
     - [AWS Redshift](#aws-redshift)
@@ -67,6 +68,19 @@ from this module is designed to be passed easily to downstream modules, streamli
 
 -------------------
 
+### [AWS ML-Ops](../catalog/aws/ml-ops/README.md)
+
+This module automates MLOps tasks associated with training Machine Learning models.
+
+The module leverages Step Functions and Lambda functions as needed. The state machine
+executes hyperparameter tuning, training, and deployments as needed. Deployment options
+supported are Sagemaker endpoints and/or batch inference.
+
+* Source: `git::https://github.com/slalom-ggp/dataops-infra//catalog/aws/ml-ops?ref=master`
+* See the [AWS ML-Ops Readme](../catalog/aws/ml-ops/README.md) for input/output specs and additional info.
+
+-------------------
+
 ### [AWS MySQL](../catalog/aws/mysql/README.md)
 
 Deploys a MySQL server running on RDS.
@@ -132,7 +146,7 @@ _(Coming soon)_
 
 -------------------
 
-_**NOTE:** This documentation was [auto-generated](../docs/build.py) using
+_**NOTE:** This documentation was [auto-generated](build.py) using
 `terraform-docs` and `s-infra` from `slalom.dataops`.
 Please do not attempt to manually update this file._
 
diff --git a/catalog/aws/ml-ops/README.md b/catalog/aws/ml-ops/README.md
@@ -38,7 +38,7 @@ supported are Sagemaker endpoints and/or batch inference.
 | max\_number\_training\_jobs | Maximum number of total training jobs for hyperparameter tuning. | `number` | `3` | no |
 | max\_parallel\_training\_jobs | Maximimum number of training jobs running in parallel for hyperparameter tuning. | `number` | `1` | no |
 | parameter\_ranges | Tuning ranges for hyperparameters.<br>Expects a map of one or both "ContinuousParameterRanges" and "IntegerParameterRanges".<br>Each item in the map should point to a list of object with the following keys:  - Name        - name of the variable to be tuned  - MinValue    - min value of the range  - MaxValue    - max value of the range  - ScalingType - 'Auto', 'Linear', 'Logarithmic', or 'ReverseLogarithmic' | <pre>map(list(object({<br>    Name        = string<br>    MinValue    = string<br>    MaxValue    = string<br>    ScalingType = string<br>  })))</pre> | <pre>{<br>  "ContinuousParameterRanges": [<br>    {<br>      "MaxValue": "10",<br>      "MinValue": "0",<br>      "Name": "gamma",<br>      "ScalingType": "Auto"<br>    },<br>    {<br>      "MaxValue": "20",<br>      "MinValue": "1",<br>      "Name": "min_child_weight",<br>      "ScalingType": "Auto"<br>    },<br>    {<br>      "MaxValue": "0.5",<br>      "MinValue": "0.1",<br>      "Name": "subsample",<br>      "ScalingType": "Auto"<br>    },<br>    {<br>      "MaxValue": "1",<br>      "MinValue": "0",<br>      "Name": "max_delta_step",<br>      "ScalingType": "Auto"<br>    },<br>    {<br>      "MaxValue": "10",<br>      "MinValue": "1",<br>      "Name": "scale_pos_weight",<br>      "ScalingType": "Auto"<br>    }<br>  ],<br>  "IntegerParameterRanges": [<br>    {<br>      "MaxValue": "10",<br>      "MinValue": "1",<br>      "Name": "max_depth",<br>      "ScalingType": "Auto"<br>    }<br>  ]<br>}</pre> | no |
-| score\_local\_path | Local path for scoring data. | `string` | `"source/data/score.csv"` | no |
+| score\_local\_path | Local path for scoring data. Set to null for endpoint inference | `string` | `"source/data/score.csv"` | no |
 | script\_path | Local path for Glue Python script. | `string` | `"source/scripts/transform.py"` | no |
 | static\_hyperparameters | Map of hyperparameter names to static values, which should not be altered during hyperparameter tuning.<br>E.g. `{ "kfold_splits" = "5" }` | `map` | <pre>{<br>  "kfold_splits": "5"<br>}</pre> | no |
 | train\_local\_path | Local path for training data. | `string` | `"source/data/train.csv"` | no |
@@ -54,6 +54,77 @@ supported are Sagemaker endpoints and/or batch inference.
 | Name | Description |
 |------|-------------|
 | summary | Summary of resources created by this module. |
+## Usage
+
+### General Usage Instructions
+
+#### Prereqs:
+
+1. Create glue jobs (see sample code in `transform.py`).
+
+#### Terraform Config:
+
+1. If additional python dependencies are needed, list these in [TK] config variable. These will be packaged into python wheels (`.whl` files) and uploaded to S3 automatically.
+2. Configure terraform variable `script_path` with location of Glue transform code.
+
+#### Terraform Deploy:
+
+1. Run `terraform apply` which will create all resources and upload files to the correct bucket (enter 'yes' when prompted).
+
+#### Execute State Machine:
+
+1. Execute the state machine by landing first your training data and then your scoring (prediction) data into the feature store S3 bucket.
+
+### Bring Your Own Model
+
+_BYOM (Bring your own Model) allows you to build a custom docker image which will be used during state machine execution, in place of the generic training image._
+
+For BYOM, perform all of the above and also the steps below.
+
+#### Additional Configuration
+
+Create a local folder in the code repository which contains at least the following files:
+
+    * `Dockerfile`
+    * `.Dockerignore`
+    * `build_and_push.sh`
+    * subfolder containing the following files:
+      * Custom python:
+        * `train` (with no file extension)
+        * `predictor.py`
+      * Generic / boilerplate (copy from standard sample):
+        * `serve` (with no file extension)
+        * `wsgi.py` (wrapper for gunicorn to find your app)
+        * `nginx.conf`
+
+## File Stores Used by MLOps Module
+
+#### File Stores (S3 Buckets):
+
+1. Input Buckets:
+   1. Feature Store - Input training and scoring data.
+2. Managed Buckets:
+   1. Source Repository - Location where Glue python scripts are stored.
+   2. Extract Store - Training data (model inputs) stored to be consumed by the training model. Default output location for the Glue transformation job(s).
+   3. Model Store - Landing zone for pickled models as they are created and tuned by SageMaker training jobs.
+   4. Metadata Store - For logging SageMaker metadata information about the tuning and training jobs.
+   5. Output Store - Output from batch transformations (csv). Ignored when running endpoint inference.
+
+
+---------------------
+
+## Source Files
+
+_Source code for this module is available using the links below._
+
+* [ecr-image.tf](ecr-image.tf)
+* [glue-crawler.tf](glue-crawler.tf)
+* [glue-job.tf](glue-job.tf)
+* [lambda.tf](lambda.tf)
+* [main.tf](main.tf)
+* [outputs.tf](outputs.tf)
+* [s3.tf](s3.tf)
+* [variables.tf](variables.tf)
 
 ---------------------
 

diff --git a/catalog/aws/mysql/README.md b/catalog/aws/mysql/README.md
@@ -20,12 +20,15 @@ Deploys a MySQL server running on RDS.
 | kms\_key\_id | Optional. The ARN for the KMS encryption key used in cluster encryption. | `string` | n/a | yes |
 | name\_prefix | Standard `name_prefix` module input. | `string` | n/a | yes |
 | resource\_tags | Standard `resource_tags` module input. | `map(string)` | n/a | yes |
+| database\_name | The name of the initial database to be created. | `string` | `"default_db"` | no |
 | identifier | The database name which will be used within connection strings and URLs. | `string` | `"rds-db"` | no |
 | instance\_class | Enter the desired node type. The default and cheapest option is 'db.t2.micro' @ ~$0.017/hr, or ~$120/mo (https://aws.amazon.com/rds/mysql/pricing/ ) | `string` | `"db.t2.micro"` | no |
+| jdbc\_cidr | List of CIDR blocks which should be allowed to connect to the instance on the JDBC port. | `list(string)` | `[]` | no |
 | jdbc\_port | Optional. Overrides the default JDBC port for incoming SQL connections. | `number` | `3306` | no |
 | mysql\_version | Optional. The specific MySQL version to use. | `string` | `"5.7.26"` | no |
 | skip\_final\_snapshot | If true, will allow terraform to destroy the RDS cluster without performing a final backup. | `bool` | `false` | no |
 | storage\_size\_in\_gb | The allocated storage value is denoted in GB. | `string` | `"20"` | no |
+| whitelist\_terraform\_ip | True to allow the terraform user to connect to the DB instance. | `bool` | `true` | no |
 
 ## Outputs
 

diff --git a/catalog/aws/postgres/README.md b/catalog/aws/postgres/README.md
@@ -23,12 +23,15 @@ Deploys a Postgres server running on RDS.
 | resource\_tags | Standard `resource_tags` module input. | `map(string)` | n/a | yes |
 | s3\_logging\_bucket | Optional. An S3 bucket to use for log collection. | `string` | n/a | yes |
 | s3\_logging\_path | Required if `s3_logging_bucket` is set. The path within the S3 bucket to use for log storage. | `string` | n/a | yes |
+| database\_name | The name of the initial database to be created. | `string` | `"default_db"` | no |
 | identifier | The database name which will be used within connection strings and URLs. | `string` | `"rds-postgres-db"` | no |
 | instance\_class | Enter the desired node type. The default and cheapest option is 'db.t2.micro' @ ~$0.017/hr, or ~$120/mo (https://aws.amazon.com/rds/mysql/pricing/ ) | `string` | `"db.t2.micro"` | no |
+| jdbc\_cidr | List of CIDR blocks which should be allowed to connect to the instance on the JDBC port. | `list(string)` | `[]` | no |
 | jdbc\_port | Optional. Overrides the default JDBC port for incoming SQL connections. | `number` | `5432` | no |
 | postgres\_version | Optional. Overrides the version of the Postres database engine. | `string` | `"11.5"` | no |
 | skip\_final\_snapshot | If true, will allow terraform to destroy the RDS cluster without performing a final backup. | `bool` | `false` | no |
 | storage\_size\_in\_gb | The allocated storage value is denoted in GB | `string` | `"10"` | no |
+| whitelist\_terraform\_ip | True to allow the terraform user to connect to the DB instance. | `bool` | `true` | no |
 
 ## Outputs
 

diff --git a/catalog/aws/redshift/README.md b/catalog/aws/redshift/README.md
@@ -15,15 +15,19 @@ Redshift is an AWS database platform which applies MPP (Massively-Parallel-Proce
 | admin\_password | The initial admin password. Must be 8 characters long. | `string` | n/a | yes |
 | elastic\_ip | Optional. An Elastic IP endpoint which will be used to for routing incoming traffic. | `string` | n/a | yes |
 | environment | Standard `environment` module input. | <pre>object({<br>    vpc_id          = string<br>    aws_region      = string<br>    public_subnets  = list(string)<br>    private_subnets = list(string)<br>  })</pre> | n/a | yes |
+| identifier | Optional. The unique identifier for the redshift cluster. | `string` | n/a | yes |
 | kms\_key\_id | Optional. The ARN for the KMS encryption key used in cluster encryption. | `string` | n/a | yes |
 | name\_prefix | Standard `name_prefix` module input. | `string` | n/a | yes |
 | resource\_tags | Standard `resource_tags` module input. | `map(string)` | n/a | yes |
 | s3\_logging\_bucket | Optional. An S3 bucket to use for log collection. | `string` | n/a | yes |
 | s3\_logging\_path | Required if `s3_logging_bucket` is set. The path within the S3 bucket to use for log storage. | `string` | n/a | yes |
+| admin\_username | Optional (default='rsadmin'). The initial admin username. | `string` | `"rsadmin"` | no |
+| jdbc\_cidr | List of CIDR blocks which should be allowed to connect to the instance on the JDBC port. | `list(string)` | `[]` | no |
 | jdbc\_port | Optional. Overrides the default JDBC port for incoming SQL connections. | `number` | `5439` | no |
 | node\_type | Enter the desired node type. The default and cheapest option is 'dc2.large' @ ~$0.25/hr, ~$180/mo (https://aws.amazon.com/redshift/pricing/) | `string` | `"dc2.large"` | no |
 | num\_nodes | Optional (default=1). The number of Redshift nodes to use. | `number` | `1` | no |
 | skip\_final\_snapshot | If true, will allow terraform to destroy the RDS cluster without performing a final backup. | `bool` | `false` | no |
+| whitelist\_terraform\_ip | True to allow the terraform user to connect to the DB instance. | `bool` | `true` | no |
 
 ## Outputs
 

diff --git a/catalog/aws/singer-taps/README.md b/catalog/aws/singer-taps/README.md
@@ -28,7 +28,7 @@ The Singer Taps platform is the open source stack which powers the [Stitcher](ht
 | container\_ram\_gb | Optional. Specify the amount of RAM to be available to the container. | `number` | `1` | no |
 | data\_file\_naming\_scheme | The naming pattern to use when landing new files in the data lake. Allowed variables are: `{tap}`, `{table}`, `{version}`, and `{file}`" | `string` | `"{tap}/{table}/v{version}/{file}"` | no |
 | scheduled\_sync\_times | A list of one or more daily sync times in `HHMM` format. E.g.: `0400` for 4am, `1600` for 4pm | `list(string)` | `[]` | no |
-| scheduled\_timezone | The timezone used in scheduling.<br>Currently the following codes are supported: PST, EST, UTC | `string` | `"PT"` | no |
+| scheduled\_timezone | The timezone used in scheduling.<br>Currently the following codes are supported: PST, PDT, EST, UTC | `string` | `"PT"` | no |
 | state\_file\_naming\_scheme | The naming pattern to use when writing or updating state files. State files keep track of<br>data recency and are necessary for incremental loading. Allowed variables are: `{tap}`, `{table}`, `{version}`, and `{file}`" | `string` | `"{tap}/{table}/state/{tap}-{table}-v{version}-state.json"` | no |
 
 ## Outputs

diff --git a/components/README.md b/components/README.md
@@ -8,8 +8,11 @@ These components define the technical building blocks which enable advanced, rea
 1. [AWS Components](#aws-components)
     - [AWS EC2](#aws-ec2)
     - [AWS ECR](#aws-ecr)
+    - [AWS ECR-Image](#aws-ecr-image)
     - [AWS ECS-Cluster](#aws-ecs-cluster)
     - [AWS ECS-Task](#aws-ecs-task)
+    - [AWS Glue-Crawler](#aws-glue-crawler)
+    - [AWS Glue-Job](#aws-glue-job)
     - [AWS Lambda-Python](#aws-lambda-python)
     - [AWS RDS](#aws-rds)
     - [AWS Redshift](#aws-redshift)
@@ -47,6 +50,29 @@ should not be accessible to external users.
 
 -------------------
 
+### [AWS ECR-Image](../components/aws/ecr-image/README.md)
+
+ECR (Elastic Compute Repository) is the private-hosted AWS equivalent of DockerHub.
+ECR allows you to securely publish docker images which should not be accessible to external users.
+
+Known Issue (TODO): ECR push requires that CLI credentials at runtime (terraform apply) match with the
+project's AWS credentails, as specified in .screts/aws-credentials.
+
+This _might_ help:
+
+```bash
+cd dataops-infra
+SET AWS_SHARED_CREDENTIALS_FILE=($pwd)/.secrets/aws-credentials
+SET AWS_PROFILE=default
+cd infra
+terraform apply
+```
+
+* Source: `git::https://github.com/slalom-ggp/dataops-infra//components/aws/ecr-image?ref=master`
+* See the [AWS ECR-Image Readme](../components/aws/ecr-image/README.md) for input/output specs and additional info.
+
+-------------------
+
 ### [AWS ECS-Cluster](../components/aws/ecs-cluster/README.md)
 
 ECS, or EC2 Container Service, is able to run docker containers natively in AWS cloud. While the module can support classic EC2-based and Fargate,
@@ -73,6 +99,26 @@ Use in combination with the `ECS-Cluster` component.
 
 -------------------
 
+### [AWS Glue-Crawler](../components/aws/glue-crawler/README.md)
+
+Glue is AWS's fully managed extract, transform, and load (ETL) service.
+A Glue crawler is used to access a data store and create table definitions.
+This can be used in conjuction with Amazon Athena to query flat files in S3 buckets using SQL.
+
+* Source: `git::https://github.com/slalom-ggp/dataops-infra//components/aws/glue-crawler?ref=master`
+* See the [AWS Glue-Crawler Readme](../components/aws/glue-crawler/README.md) for input/output specs and additional info.
+
+-------------------
+
+### [AWS Glue-Job](../components/aws/glue-job/README.md)
+
+Glue is AWS's fully managed extract, transform, and load (ETL) service. A Glue job can be used job to run ETL Python scripts.
+
+* Source: `git::https://github.com/slalom-ggp/dataops-infra//components/aws/glue-job?ref=master`
+* See the [AWS Glue-Job Readme](../components/aws/glue-job/README.md) for input/output specs and additional info.
+
+-------------------
+
 ### [AWS Lambda-Python](../components/aws/lambda-python/README.md)
 
 AWS Lambda is a platform which enables serverless execution of arbitrary functions. This module specifically focuses on the
@@ -155,7 +201,7 @@ Included automatically when creating this module:
 * 1 VPC which contains the following:
     * 2 private subnets (for resources which **do not** need a public IP address)
     * 2 public subnets (for resources which do need a public IP address)
-    * 1 NAT gateway (allows private sugnet resources to reach the outside world)
+    * 1 NAT gateway (allows private subnet resources to reach the outside world)
     * 1 Intenet gateway (allows resources in public and private subnets to reach the internet)
     * route tables and routes to connect all of the above
 
@@ -176,7 +222,7 @@ _(Coming soon)_
 
 -------------------
 
-_**NOTE:** This documentation was [auto-generated](../docs/build.py) using
+_**NOTE:** This documentation was [auto-generated](build.py) using
 `terraform-docs` and `s-infra` from `slalom.dataops`.
 Please do not attempt to manually update this file._
 
diff --git a/components/aws/ecr-image/README.md b/components/aws/ecr-image/README.md
@@ -9,6 +9,19 @@
 ECR (Elastic Compute Repository) is the private-hosted AWS equivalent of DockerHub.
 ECR allows you to securely publish docker images which should not be accessible to external users.
 
+Known Issue (TODO): ECR push requires that CLI credentials at runtime (terraform apply) match with the
+project's AWS credentails, as specified in .screts/aws-credentials.
+
+This _might_ help:
+
+```bash
+cd dataops-infra
+SET AWS_SHARED_CREDENTIALS_FILE=($pwd)/.secrets/aws-credentials
+SET AWS_PROFILE=default
+cd infra
+terraform apply
+```
+
 ## Inputs
 
 | Name | Description | Type | Default | Required |
@@ -32,6 +45,16 @@ ECR allows you to securely publish docker images which should not be accessible
 
 ---------------------
 
+## Source Files
+
+_Source code for this module is available using the links below._
+
+* [main.tf](main.tf)
+* [outputs.tf](outputs.tf)
+* [variables.tf](variables.tf)
+
+---------------------
+
 _**NOTE:** This documentation was auto-generated using
 `terraform-docs` and `s-infra` from `slalom.dataops`.
 Please do not attempt to manually update this file._
diff --git a/components/aws/glue-crawler/README.md b/components/aws/glue-crawler/README.md
@@ -30,6 +30,17 @@ This can be used in conjuction with Amazon Athena to query flat files in S3 buck
 
 ---------------------
 
+## Source Files
+
+_Source code for this module is available using the links below._
+
+* [iam.tf](iam.tf)
+* [main.tf](main.tf)
+* [outputs.tf](outputs.tf)
+* [variables.tf](variables.tf)
+
+---------------------
+
 _**NOTE:** This documentation was auto-generated using
 `terraform-docs` and `s-infra` from `slalom.dataops`.
 Please do not attempt to manually update this file._
diff --git a/components/aws/glue-job/README.md b/components/aws/glue-job/README.md
@@ -29,6 +29,17 @@ Glue is AWS's fully managed extract, transform, and load (ETL) service. A Glue j
 
 ---------------------
 
+## Source Files
+
+_Source code for this module is available using the links below._
+
+* [iam.tf](iam.tf)
+* [main.tf](main.tf)
+* [outputs.tf](outputs.tf)
+* [variables.tf](variables.tf)
+
+---------------------
+
 _**NOTE:** This documentation was auto-generated using
 `terraform-docs` and `s-infra` from `slalom.dataops`.
 Please do not attempt to manually update this file._