Skip to content

Commit

Permalink
refactor deps, update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
aaronsteers committed May 14, 2020
1 parent f9eae20 commit 374bdb5
Show file tree
Hide file tree
Showing 18 changed files with 280 additions and 28 deletions.
16 changes: 15 additions & 1 deletion catalog/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ The Infrastructure Catalog contains ready-to-deploy terraform modules for a vari
- [AWS Data-Lake](#aws-data-lake)
- [AWS DBT](#aws-dbt)
- [AWS Environment](#aws-environment)
- [AWS ML-Ops](#aws-ml-ops)
- [AWS MySQL](#aws-mysql)
- [AWS Postgres](#aws-postgres)
- [AWS Redshift](#aws-redshift)
Expand Down Expand Up @@ -67,6 +68,19 @@ from this module is designed to be passed easily to downstream modules, streamli

-------------------

### [AWS ML-Ops](../catalog/aws/ml-ops/README.md)

This module automates MLOps tasks associated with training Machine Learning models.

The module leverages Step Functions and Lambda functions as needed. The state machine
executes hyperparameter tuning, training, and deployments as needed. Deployment options
supported are Sagemaker endpoints and/or batch inference.

* Source: `git::https://github.com/slalom-ggp/dataops-infra//catalog/aws/ml-ops?ref=master`
* See the [AWS ML-Ops Readme](../catalog/aws/ml-ops/README.md) for input/output specs and additional info.

-------------------

### [AWS MySQL](../catalog/aws/mysql/README.md)

Deploys a MySQL server running on RDS.
Expand Down Expand Up @@ -132,7 +146,7 @@ _(Coming soon)_

-------------------

_**NOTE:** This documentation was [auto-generated](../docs/build.py) using
_**NOTE:** This documentation was [auto-generated](build.py) using
`terraform-docs` and `s-infra` from `slalom.dataops`.
Please do not attempt to manually update this file._

73 changes: 72 additions & 1 deletion catalog/aws/ml-ops/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ supported are Sagemaker endpoints and/or batch inference.
| max\_number\_training\_jobs | Maximum number of total training jobs for hyperparameter tuning. | `number` | `3` | no |
| max\_parallel\_training\_jobs | Maximimum number of training jobs running in parallel for hyperparameter tuning. | `number` | `1` | no |
| parameter\_ranges | Tuning ranges for hyperparameters.<br>Expects a map of one or both "ContinuousParameterRanges" and "IntegerParameterRanges".<br>Each item in the map should point to a list of object with the following keys: - Name - name of the variable to be tuned - MinValue - min value of the range - MaxValue - max value of the range - ScalingType - 'Auto', 'Linear', 'Logarithmic', or 'ReverseLogarithmic' | <pre>map(list(object({<br> Name = string<br> MinValue = string<br> MaxValue = string<br> ScalingType = string<br> })))</pre> | <pre>{<br> "ContinuousParameterRanges": [<br> {<br> "MaxValue": "10",<br> "MinValue": "0",<br> "Name": "gamma",<br> "ScalingType": "Auto"<br> },<br> {<br> "MaxValue": "20",<br> "MinValue": "1",<br> "Name": "min_child_weight",<br> "ScalingType": "Auto"<br> },<br> {<br> "MaxValue": "0.5",<br> "MinValue": "0.1",<br> "Name": "subsample",<br> "ScalingType": "Auto"<br> },<br> {<br> "MaxValue": "1",<br> "MinValue": "0",<br> "Name": "max_delta_step",<br> "ScalingType": "Auto"<br> },<br> {<br> "MaxValue": "10",<br> "MinValue": "1",<br> "Name": "scale_pos_weight",<br> "ScalingType": "Auto"<br> }<br> ],<br> "IntegerParameterRanges": [<br> {<br> "MaxValue": "10",<br> "MinValue": "1",<br> "Name": "max_depth",<br> "ScalingType": "Auto"<br> }<br> ]<br>}</pre> | no |
| score\_local\_path | Local path for scoring data. | `string` | `"source/data/score.csv"` | no |
| score\_local\_path | Local path for scoring data. Set to null for endpoint inference | `string` | `"source/data/score.csv"` | no |
| script\_path | Local path for Glue Python script. | `string` | `"source/scripts/transform.py"` | no |
| static\_hyperparameters | Map of hyperparameter names to static values, which should not be altered during hyperparameter tuning.<br>E.g. `{ "kfold_splits" = "5" }` | `map` | <pre>{<br> "kfold_splits": "5"<br>}</pre> | no |
| train\_local\_path | Local path for training data. | `string` | `"source/data/train.csv"` | no |
Expand All @@ -54,6 +54,77 @@ supported are Sagemaker endpoints and/or batch inference.
| Name | Description |
|------|-------------|
| summary | Summary of resources created by this module. |
## Usage

### General Usage Instructions

#### Prereqs:

1. Create glue jobs (see sample code in `transform.py`).

#### Terraform Config:

1. If additional python dependencies are needed, list these in [TK] config variable. These will be packaged into python wheels (`.whl` files) and uploaded to S3 automatically.
2. Configure terraform variable `script_path` with location of Glue transform code.

#### Terraform Deploy:

1. Run `terraform apply` which will create all resources and upload files to the correct bucket (enter 'yes' when prompted).

#### Execute State Machine:

1. Execute the state machine by landing first your training data and then your scoring (prediction) data into the feature store S3 bucket.

### Bring Your Own Model

_BYOM (Bring your own Model) allows you to build a custom docker image which will be used during state machine execution, in place of the generic training image._

For BYOM, perform all of the above and also the steps below.

#### Additional Configuration

Create a local folder in the code repository which contains at least the following files:

* `Dockerfile`
* `.Dockerignore`
* `build_and_push.sh`
* subfolder containing the following files:
* Custom python:
* `train` (with no file extension)
* `predictor.py`
* Generic / boilerplate (copy from standard sample):
* `serve` (with no file extension)
* `wsgi.py` (wrapper for gunicorn to find your app)
* `nginx.conf`

## File Stores Used by MLOps Module

#### File Stores (S3 Buckets):

1. Input Buckets:
1. Feature Store - Input training and scoring data.
2. Managed Buckets:
1. Source Repository - Location where Glue python scripts are stored.
2. Extract Store - Training data (model inputs) stored to be consumed by the training model. Default output location for the Glue transformation job(s).
3. Model Store - Landing zone for pickled models as they are created and tuned by SageMaker training jobs.
4. Metadata Store - For logging SageMaker metadata information about the tuning and training jobs.
5. Output Store - Output from batch transformations (csv). Ignored when running endpoint inference.


---------------------

## Source Files

_Source code for this module is available using the links below._

* [ecr-image.tf](ecr-image.tf)
* [glue-crawler.tf](glue-crawler.tf)
* [glue-job.tf](glue-job.tf)
* [lambda.tf](lambda.tf)
* [main.tf](main.tf)
* [outputs.tf](outputs.tf)
* [s3.tf](s3.tf)
* [variables.tf](variables.tf)

---------------------

Expand Down
3 changes: 3 additions & 0 deletions catalog/aws/mysql/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,15 @@ Deploys a MySQL server running on RDS.
| kms\_key\_id | Optional. The ARN for the KMS encryption key used in cluster encryption. | `string` | n/a | yes |
| name\_prefix | Standard `name_prefix` module input. | `string` | n/a | yes |
| resource\_tags | Standard `resource_tags` module input. | `map(string)` | n/a | yes |
| database\_name | The name of the initial database to be created. | `string` | `"default_db"` | no |
| identifier | The database name which will be used within connection strings and URLs. | `string` | `"rds-db"` | no |
| instance\_class | Enter the desired node type. The default and cheapest option is 'db.t2.micro' @ ~$0.017/hr, or ~$120/mo (https://aws.amazon.com/rds/mysql/pricing/ ) | `string` | `"db.t2.micro"` | no |
| jdbc\_cidr | List of CIDR blocks which should be allowed to connect to the instance on the JDBC port. | `list(string)` | `[]` | no |
| jdbc\_port | Optional. Overrides the default JDBC port for incoming SQL connections. | `number` | `3306` | no |
| mysql\_version | Optional. The specific MySQL version to use. | `string` | `"5.7.26"` | no |
| skip\_final\_snapshot | If true, will allow terraform to destroy the RDS cluster without performing a final backup. | `bool` | `false` | no |
| storage\_size\_in\_gb | The allocated storage value is denoted in GB. | `string` | `"20"` | no |
| whitelist\_terraform\_ip | True to allow the terraform user to connect to the DB instance. | `bool` | `true` | no |

## Outputs

Expand Down
3 changes: 3 additions & 0 deletions catalog/aws/postgres/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,15 @@ Deploys a Postgres server running on RDS.
| resource\_tags | Standard `resource_tags` module input. | `map(string)` | n/a | yes |
| s3\_logging\_bucket | Optional. An S3 bucket to use for log collection. | `string` | n/a | yes |
| s3\_logging\_path | Required if `s3_logging_bucket` is set. The path within the S3 bucket to use for log storage. | `string` | n/a | yes |
| database\_name | The name of the initial database to be created. | `string` | `"default_db"` | no |
| identifier | The database name which will be used within connection strings and URLs. | `string` | `"rds-postgres-db"` | no |
| instance\_class | Enter the desired node type. The default and cheapest option is 'db.t2.micro' @ ~$0.017/hr, or ~$120/mo (https://aws.amazon.com/rds/mysql/pricing/ ) | `string` | `"db.t2.micro"` | no |
| jdbc\_cidr | List of CIDR blocks which should be allowed to connect to the instance on the JDBC port. | `list(string)` | `[]` | no |
| jdbc\_port | Optional. Overrides the default JDBC port for incoming SQL connections. | `number` | `5432` | no |
| postgres\_version | Optional. Overrides the version of the Postres database engine. | `string` | `"11.5"` | no |
| skip\_final\_snapshot | If true, will allow terraform to destroy the RDS cluster without performing a final backup. | `bool` | `false` | no |
| storage\_size\_in\_gb | The allocated storage value is denoted in GB | `string` | `"10"` | no |
| whitelist\_terraform\_ip | True to allow the terraform user to connect to the DB instance. | `bool` | `true` | no |

## Outputs

Expand Down
4 changes: 4 additions & 0 deletions catalog/aws/redshift/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,19 @@ Redshift is an AWS database platform which applies MPP (Massively-Parallel-Proce
| admin\_password | The initial admin password. Must be 8 characters long. | `string` | n/a | yes |
| elastic\_ip | Optional. An Elastic IP endpoint which will be used to for routing incoming traffic. | `string` | n/a | yes |
| environment | Standard `environment` module input. | <pre>object({<br> vpc_id = string<br> aws_region = string<br> public_subnets = list(string)<br> private_subnets = list(string)<br> })</pre> | n/a | yes |
| identifier | Optional. The unique identifier for the redshift cluster. | `string` | n/a | yes |
| kms\_key\_id | Optional. The ARN for the KMS encryption key used in cluster encryption. | `string` | n/a | yes |
| name\_prefix | Standard `name_prefix` module input. | `string` | n/a | yes |
| resource\_tags | Standard `resource_tags` module input. | `map(string)` | n/a | yes |
| s3\_logging\_bucket | Optional. An S3 bucket to use for log collection. | `string` | n/a | yes |
| s3\_logging\_path | Required if `s3_logging_bucket` is set. The path within the S3 bucket to use for log storage. | `string` | n/a | yes |
| admin\_username | Optional (default='rsadmin'). The initial admin username. | `string` | `"rsadmin"` | no |
| jdbc\_cidr | List of CIDR blocks which should be allowed to connect to the instance on the JDBC port. | `list(string)` | `[]` | no |
| jdbc\_port | Optional. Overrides the default JDBC port for incoming SQL connections. | `number` | `5439` | no |
| node\_type | Enter the desired node type. The default and cheapest option is 'dc2.large' @ ~$0.25/hr, ~$180/mo (https://aws.amazon.com/redshift/pricing/) | `string` | `"dc2.large"` | no |
| num\_nodes | Optional (default=1). The number of Redshift nodes to use. | `number` | `1` | no |
| skip\_final\_snapshot | If true, will allow terraform to destroy the RDS cluster without performing a final backup. | `bool` | `false` | no |
| whitelist\_terraform\_ip | True to allow the terraform user to connect to the DB instance. | `bool` | `true` | no |

## Outputs

Expand Down
2 changes: 1 addition & 1 deletion catalog/aws/singer-taps/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ The Singer Taps platform is the open source stack which powers the [Stitcher](ht
| container\_ram\_gb | Optional. Specify the amount of RAM to be available to the container. | `number` | `1` | no |
| data\_file\_naming\_scheme | The naming pattern to use when landing new files in the data lake. Allowed variables are: `{tap}`, `{table}`, `{version}`, and `{file}`" | `string` | `"{tap}/{table}/v{version}/{file}"` | no |
| scheduled\_sync\_times | A list of one or more daily sync times in `HHMM` format. E.g.: `0400` for 4am, `1600` for 4pm | `list(string)` | `[]` | no |
| scheduled\_timezone | The timezone used in scheduling.<br>Currently the following codes are supported: PST, EST, UTC | `string` | `"PT"` | no |
| scheduled\_timezone | The timezone used in scheduling.<br>Currently the following codes are supported: PST, PDT, EST, UTC | `string` | `"PT"` | no |
| state\_file\_naming\_scheme | The naming pattern to use when writing or updating state files. State files keep track of<br>data recency and are necessary for incremental loading. Allowed variables are: `{tap}`, `{table}`, `{version}`, and `{file}`" | `string` | `"{tap}/{table}/state/{tap}-{table}-v{version}-state.json"` | no |

## Outputs
Expand Down
50 changes: 48 additions & 2 deletions components/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,11 @@ These components define the technical building blocks which enable advanced, rea
1. [AWS Components](#aws-components)
- [AWS EC2](#aws-ec2)
- [AWS ECR](#aws-ecr)
- [AWS ECR-Image](#aws-ecr-image)
- [AWS ECS-Cluster](#aws-ecs-cluster)
- [AWS ECS-Task](#aws-ecs-task)
- [AWS Glue-Crawler](#aws-glue-crawler)
- [AWS Glue-Job](#aws-glue-job)
- [AWS Lambda-Python](#aws-lambda-python)
- [AWS RDS](#aws-rds)
- [AWS Redshift](#aws-redshift)
Expand Down Expand Up @@ -47,6 +50,29 @@ should not be accessible to external users.

-------------------

### [AWS ECR-Image](../components/aws/ecr-image/README.md)

ECR (Elastic Compute Repository) is the private-hosted AWS equivalent of DockerHub.
ECR allows you to securely publish docker images which should not be accessible to external users.

Known Issue (TODO): ECR push requires that CLI credentials at runtime (terraform apply) match with the
project's AWS credentails, as specified in .screts/aws-credentials.

This _might_ help:

```bash
cd dataops-infra
SET AWS_SHARED_CREDENTIALS_FILE=($pwd)/.secrets/aws-credentials
SET AWS_PROFILE=default
cd infra
terraform apply
```

* Source: `git::https://github.com/slalom-ggp/dataops-infra//components/aws/ecr-image?ref=master`
* See the [AWS ECR-Image Readme](../components/aws/ecr-image/README.md) for input/output specs and additional info.

-------------------

### [AWS ECS-Cluster](../components/aws/ecs-cluster/README.md)

ECS, or EC2 Container Service, is able to run docker containers natively in AWS cloud. While the module can support classic EC2-based and Fargate,
Expand All @@ -73,6 +99,26 @@ Use in combination with the `ECS-Cluster` component.

-------------------

### [AWS Glue-Crawler](../components/aws/glue-crawler/README.md)

Glue is AWS's fully managed extract, transform, and load (ETL) service.
A Glue crawler is used to access a data store and create table definitions.
This can be used in conjuction with Amazon Athena to query flat files in S3 buckets using SQL.

* Source: `git::https://github.com/slalom-ggp/dataops-infra//components/aws/glue-crawler?ref=master`
* See the [AWS Glue-Crawler Readme](../components/aws/glue-crawler/README.md) for input/output specs and additional info.

-------------------

### [AWS Glue-Job](../components/aws/glue-job/README.md)

Glue is AWS's fully managed extract, transform, and load (ETL) service. A Glue job can be used job to run ETL Python scripts.

* Source: `git::https://github.com/slalom-ggp/dataops-infra//components/aws/glue-job?ref=master`
* See the [AWS Glue-Job Readme](../components/aws/glue-job/README.md) for input/output specs and additional info.

-------------------

### [AWS Lambda-Python](../components/aws/lambda-python/README.md)

AWS Lambda is a platform which enables serverless execution of arbitrary functions. This module specifically focuses on the
Expand Down Expand Up @@ -155,7 +201,7 @@ Included automatically when creating this module:
* 1 VPC which contains the following:
* 2 private subnets (for resources which **do not** need a public IP address)
* 2 public subnets (for resources which do need a public IP address)
* 1 NAT gateway (allows private sugnet resources to reach the outside world)
* 1 NAT gateway (allows private subnet resources to reach the outside world)
* 1 Intenet gateway (allows resources in public and private subnets to reach the internet)
* route tables and routes to connect all of the above

Expand All @@ -176,7 +222,7 @@ _(Coming soon)_

-------------------

_**NOTE:** This documentation was [auto-generated](../docs/build.py) using
_**NOTE:** This documentation was [auto-generated](build.py) using
`terraform-docs` and `s-infra` from `slalom.dataops`.
Please do not attempt to manually update this file._

23 changes: 23 additions & 0 deletions components/aws/ecr-image/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,19 @@
ECR (Elastic Compute Repository) is the private-hosted AWS equivalent of DockerHub.
ECR allows you to securely publish docker images which should not be accessible to external users.

Known Issue (TODO): ECR push requires that CLI credentials at runtime (terraform apply) match with the
project's AWS credentails, as specified in .screts/aws-credentials.

This _might_ help:

```bash
cd dataops-infra
SET AWS_SHARED_CREDENTIALS_FILE=($pwd)/.secrets/aws-credentials
SET AWS_PROFILE=default
cd infra
terraform apply
```

## Inputs

| Name | Description | Type | Default | Required |
Expand All @@ -32,6 +45,16 @@ ECR allows you to securely publish docker images which should not be accessible

---------------------

## Source Files

_Source code for this module is available using the links below._

* [main.tf](main.tf)
* [outputs.tf](outputs.tf)
* [variables.tf](variables.tf)

---------------------

_**NOTE:** This documentation was auto-generated using
`terraform-docs` and `s-infra` from `slalom.dataops`.
Please do not attempt to manually update this file._
11 changes: 11 additions & 0 deletions components/aws/glue-crawler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,17 @@ This can be used in conjuction with Amazon Athena to query flat files in S3 buck

---------------------

## Source Files

_Source code for this module is available using the links below._

* [iam.tf](iam.tf)
* [main.tf](main.tf)
* [outputs.tf](outputs.tf)
* [variables.tf](variables.tf)

---------------------

_**NOTE:** This documentation was auto-generated using
`terraform-docs` and `s-infra` from `slalom.dataops`.
Please do not attempt to manually update this file._
11 changes: 11 additions & 0 deletions components/aws/glue-job/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,17 @@ Glue is AWS's fully managed extract, transform, and load (ETL) service. A Glue j

---------------------

## Source Files

_Source code for this module is available using the links below._

* [iam.tf](iam.tf)
* [main.tf](main.tf)
* [outputs.tf](outputs.tf)
* [variables.tf](variables.tf)

---------------------

_**NOTE:** This documentation was auto-generated using
`terraform-docs` and `s-infra` from `slalom.dataops`.
Please do not attempt to manually update this file._
Loading

0 comments on commit 374bdb5

Please sign in to comment.