Skip to content
This repository has been archived by the owner on Oct 13, 2023. It is now read-only.

Infrastructure as Code (AWS)

Sayon Sivakumaran edited this page Apr 30, 2021 · 13 revisions

Infrastructure as Code - AWS

Terraform is used to create the underlying infrastructure to support Hydra.

Remote State

When working collaboratively on a Terraform project, it is recommended to use remote state and lock files. Remote storage of the state and locks should be done before provisioning the infrastructure. Accomplishing this can be done using a S3 bucket and DynamoDB table to store the state and locks of a Terraform project respectively. To set up this remote backend, complete the following steps.

Prerequesites

  • Read and write access to S3 in AWS account

Getting Started

  1. Configure your AWS CLI.

  2. Change directory into ~/hydra/iac/aws/remote_state.

  3. Create variable definitions file (.tfvars) with suited values or manually enter variable values on command line.

  4. Initialize the Terraform project.

    terraform init
    
  5. Review and authorize changes to infrastructure.

    terraform apply
    

Input Variables

Name Description Type
remote_state_bucket_name The name of the bucket that stores the remote state of a Terraform project string
remote_locks_dynamodb_name The name of the DynamoDB table that stores the remote locks of a Terraform project string
remote_locks_read_capacity The read capacity of the DynamoDB table that stores the remote locks of a Terraform project number
remote_locks_write_capacity The write capacity of the DynamoDB table that stores the remote locks of a Terraform project number

Batch

Batch is a cloud service that allows to run large-scale batch computing jobs using AWS. Batch relies on compute environments, which contain AWS ECS container instances to run containerized batch jobs, and is able to dynamically scale in allocated CPUs based on different demands. These jobs are submitted to job queues where they reside until they can be scheduled to run in a compute environment.

To support jobs being run using Batch in Hydra, this infrastructure setup supports tracking job metadata from Hydra using an RDS MySQL instance whose credentials are stored in Secrets Manager. This database is initialized using a table setup SQL script that defines the schemas to be used within tables of the database, which is stored as a S3 bucket object. Once the database has been created, a lambda function is invoked using Terraform that will sequentially execute the data definition commands within the SQL script into the newly created database.

Prerequisites

  • Read and write access to all above mentioned services on your AWS account

Getting Started

  1. Configure your AWS CLI.

  2. Change directory into ~/hydra/iac/aws/batch.

  3. In main.tf, set the appropriate bucket, region, and dynamodb_table values to the remote backend that was created earlier. Set the key value to the path that the state and locks will be stored.

  4. Create variable definitions file (.tfvars) with suited values or manually enter variable values on command line.

  5. Initialize the Terraform project

    terraform init
    
  6. NOTE: Step 7 and 8 can be completed together using make build_lambda.

  7. Change directory into ~/hydra/iac/aws/batch/modules/lambda/function. Execute the following command

    pip3 install pymysql sqlalchemy -t .
    
  8. Next, select all folders and files in this directory and compress into a ZIP file named batch_lambda.zip.

    zip -r batch_lambda.zip .
    
  9. Change directory into ~/hydra/iac/aws/batch. Review and authorize changes to the infrastructure.

    terraform apply
    

Modules

permissions

The permissions module is responsible for creating IAM roles with appropriate permissions to attach to the computer environment service, computer environment instance, and the lambda function.

Input Variables

Name Description Type
compute_envionment_service_role_name IAM name of the compute environment service role string
compute_envionment_service_iam_policy_arn IAM policies attached to compute environment service role list(string)
compute_envionment_instance_role_name IAM name of the compute environment instance role string
compute_envionment_instance_iam_policy_arn IAM policies attached to compute environment instance role list(string)
lambda_service_role_name IAM name of the lambda function service role string
lambda_service_iam_policy_arn IAM policies attached to the lambda function service role list(string)

Output Variables

Name Description
compute_environment_service_role_arn ARN of IAM role of the compute environment service role
compute_environmnet_instance_profile_arn ARN of the IAM compute environment instance profile
lambda_service_role_arn ARN of the IAM lambda function service role

secrets

The secrets module is responsible for randomly generating a username and password and then storing them in Secrets Manager.

Input Variables

Name Description Type
username_length Number of characters in randomly generated username number
password_length Number of characters in randomly generated password number
username_recovery_window Number of days before username secret can be deleted number
password_recovery_window Number of days before password secret can be deleted number
username_secret_name Name of the username secret string
password_secret_name Name of the password secret string

Output Values

Name Description
username Randomly generated username
username_secret Secret name of the username of the RDS instance
password Randomly generated password
password_secret Secret name of the password of the RDS instance

networking

The networking module is responsible for creating a database subnet group to be associated with an RDS instance.

Input Variables

Name Description Type
rds_subnet_group_name Name of the database subnet group to be created string
rds_subnets The IDs of the subnets attached to be attached to the RDS subnet group list(string)

Output Values

Name Description
db_subnet_group Name of the created database subnet group

storage

The storage module is responsible for creating an RDS MySQL instance and storing the table setup SQL script as an S3 object.

Input Variables

Name Description Type
table_setup_script_bucket_name The name of the S3 bucket that will store the table setup script string
table_setup_script_bucket_key The key of the S3 bucket that will store the table setup script string
table_setup_script_local_path The local path of the SQL script to be executed in RDS string
batch_backend_store_identifier The identifier of the RDS database to be created string
allocated_storage The allocated storage of the RDS database to be created string
storage_type The storage type of the RDS database to be created (in GiB) string
db_engine_version The engine version of the RDS MySQL database string
db_instance_class The instance class of the RDS database to be created string
db_default_name The name of the default database that is created in RDS string
skip_final_snapshot Whether a final snapshot is created immediately before the database is deleted bool
db_username The admin username of the database string
db_password The admin password of the database string
db_subnet_group_name The name of the database subnet group string
vpc_security_groups The security groups associated with the database instance list(string)
publicly_accessible Whether the RDS instance is made is made publicly accessible bool

Output Values

Name Description
db_host The hostname of the created RDS MySQL instance
table_setup_script_bucket_name S3 Bucket that stores the table setup SQL script
table_setup_script_bucket_key S3 Key of bucket that stores the table setup SQL script

batch

The batch module is responsible for dynamically creating job queues and compute environments in AWS Batch.

Input Variables

Name Description Type
aws_region AWS Region string
compute_environments List of maps of compute environments to be created; map key name is 'name', map value name is 'instance_type' list
compute_environment_instance_profile_arn ARN of the instance profile to be used in compute environment string
compute_environment_resource_type Resource type to be used in compute environment: Valid options are 'EC2' or 'SPOT' string
compute_environment_max_vcpus Maximum vCPUs that the compute environment should maintain number
compute_environment_min_vcpus Minimum vCPUs that the compute environment should maintain number
compute_environment_security_group_ids EC2 security groups associated with instances within compute environment list(string)
compute_environment_service_role_arn ARN of the IAM role allowing Batch to call other services string
compute_environment_subnets Subnets that compute resources are launched in list(string)
compute_environment_type The type of the compute environment: Valid options are 'MANAGED' or 'UNMANAGED' string
job_queues List of maps of compute environments to be created; map key name is 'name', map value name is 'compute_environment' list
job_queue_priority Priority of the job queue number
job_queue_state The state of the job queue: Valid options are 'ENABLED' or 'DISABLED' string

lambda

The lambda module is responsible for building a lambda function with handler batch_lambda.initialize_db, and invoking this function.

Input Variables

Name Description Type
aws_region AWS region string
lambda_service_role_arn ARN of the IAM lambda function service role string
lambda_function_file_path File path of the lambda ZIP file that will be executed string
lambda_function_timeout Timeout of the executed lambda function number
lambda_function_name Name of the lambda function that will be created string
lambda_security_group_ids List of security groups to attach lambda function to list(string)
lambda_subnets List of subnets to attach lambda function to list(string)
database_hostname Hostname of the Batch RDS instance string
database_username_secret Secret name of the Batch RDS Username string
database_password_secret Secret name of the Batch RDS Password string
database_default_name Default database name created in RDS instance string
table_setup_script_bucket_name The name of the S3 bucket that will store the table setup script string
table_setup_script_bucket_key The key of the S3 bucket that will store the table setup script string

Output Variables

Name Description
lambda_invocation_response Output of the invoked lambda function initializing the batch database

Useful Commands

make destroy

To destroy all of the Batch associated infrastructure in Hydra created with Terraform, there is a 3-step process involved:

  1. Destroy all the job queues managed by this Terraform project.

    terraform destroy -target=module.batch.aws_batch_job_queue.batch_job_queue
    
  2. Destroy all of the compute environments managed by this Terraform project.

    terraform destroy -target=module.batch.aws_batch_compute_environment.batch_compute_environment
    
  3. Destroy all remaining infrastructure managed by this Terraform project.

    terraform destroy
    

The process of running these steps can be done using make destroy.

make build_lambda

  1. Install the PyPI libraries pymysql and sqlalchemy locally in the path ~/hydra/iac/aws/batch/modules/lambda/function.

    pip3 install pymysql sqlalchemy -t ./modules/lambda/function
    
  2. Change directory into the path ~/hydra/iac/aws/batch/modules/lambda/function, create a ZIP file called batch_lambda.zip compressing all existing files in this path, and change directory back into the path ~/hydra/iac/aws/batch.

    cd ./modules/lambda/function; zip -r batch_lambda.zip .; cd ../../..
    

The process of running these steps can be done using make build_lambda.

MLflow

MLflow is an open source platform that manages the machine learning lifecycle. In this infrastructure setup, a Docker container is used to run the MLflow tracking server, and is deployed using ECS Fargate, where the Docker images will be stored using ECR. MLflow logs are stored in an RDS MySQL instance whose credentials are stored in Secrets Manager, and MLflow models are stored in an S3 bucket. Autoscaling is set up so that containers will be automatically deployed and destroyed based on demand. To setup this infrastructure, complete the following steps.

NOTE: In this infrastructure build, please ensure that when tracking jobs using a new experiment name, be sure to create the experiment using the MLflow UI before tracking jobs under this experiment.

Prerequisites

  • Read and write access to all above mentioned services on your AWS account

Getting Started

  1. Configure your AWS CLI.

  2. Change directory into ~/hydra/iac/aws/mlflow.

  3. In main.tf, set the appropriate bucket, region, and dynamodb_table values to the remote backend that was created earlier. Set the key value to the path that the state and locks will be stored.

  4. Create variable definitions file (.tfvars) with suited values or manually enter variable values on command line.

  5. Initialize the Terraform project

    terraform init
    
  6. In docker_push.sh, set the region and repository variables to use the desired values of your Docker image repository on ECR

  7. NOTE: Step 8, 9, and 10 can be completed together by running make start.

  8. Create the Docker image registry on ECR.

    terraform apply -target=module.container_repository.aws_ecr_repository.mlflow_container_repository
    
  9. Push the local Dockerfile to ECR.

    bash docker_push.sh
    
  10. Review and authorize changes to the remaining infrastructure.

    terraform apply
    

Modules

container_repository

The container_repository is responsible for creating a Docker image registry using ECR.

Input Variables

Name Description Type
mlflow_container_repository Name of the Docker container registry to be created string
scan_on_push Scan docker image on push for vulnerabilities bool

Output Values

Name Description
container_repository_url URL of the created Docker container registry

permissions

The permissions module is responsible for creating an IAM role to complete the necessary tasks of ECS and a Security Group to control inbound and outbound traffic.

Input Variables

Name Description Type
mlflow_ecs_tasks_role Name of the IAM role to be created string
ecs_task_iam_policy_arn IAM policies to attached to the created IAM role list(string)
mlflow_sg Name of the security group to be created string
vpc_id The ID of the VPC to be used for the security group string
cidr_blocks List of CIDR blocks to allow ingress access list(string)

Output Values

Name Description
mlflow_sg_id ID of the created security group
mlflow_ecs_tasks_role_arn ARN of the created IAM role that will execute ECS tasks

networking

The networking module is responsible for creating a database subnet group to be associated with an RDS instance.

Input Variables

Name Description Type
rds_subnet_group_name Name of the database subnet group to be created string
rds_subnets The IDs of the subnets attached to be attached to the RDS subnet group list(string)

Output Values

Name Description
db_subnet_group Name of the created database subnet group

secrets

The secrets module is responsible for randomly generating a username and password and then storing them in Secrets Manager.

Input Variables

Name Description Type
username_length Number of characters in randomly generated username number
password_length Number of characters in randomly generated password number
username_recovery_window Number of days before username secret can be deleted number
password_recovery_window Number of days before password secret can be deleted number
username_secret_name Name of the username secret string
password_secret_name Name of the password secret string

Output Values

Name Description
username Randomly generated username
username_arn ARN of username secret
password Randomly generated password
password_arn ARN of password secret

load_balancing

The load_balancing module is responsible for creating an application load balancer to be used in AWS.

Input Variables

Name Description Type
vpc_id The ID of the VPC to be used for the load balancer string
lb_name Name of the application to be created string
lb_security_groups List of the security group IDs to be attached to the load balancer list(string)
lb_subnets List of the subnets to be attached to the load balancer (must be at least two) list(string)
lb_target_group Name of the load balancer target group to be created string

Output Values

Name Description
lb_target_group_arn ARN of the created Load Balancer target group

storage

The storage module is responsible for creating an RDS MySQL instance and S3 bucket to be the MLflow backend store and artifact store respectively.

Input Variables

Name Description Type
mlflow_artifact_store The name of the S3 bucket to be created string
mlflow_backend_store_identifier The identifier of the RDS database to be created string
allocated_storage The allocated storage of the RDS database to be created string
storage_type The storage type of the RDS database to be created (in GiB) string
db_engine_version The engine version of the RDS MySQL database string
db_instance_class The instance class of the RDS database to be created string
db_default_name The name of the default database that is created in RDS string
skip_final_snapshot Whether a final snapshot is created immediately before the database is deleted bool
db_username The admin username of the database string
db_password The admin password of the database string
db_subnet_group_name The name of the database subnet group string
vpc_security_groups The security groups associated with the database instance list(string)

Output Values

Name Description
db_host The hostname of the created RDS MySQL instance
db_name The name of the default database name of the RDS MySQL instance
s3_bucket The name of the created S3 bucket

task_deployment

The task_deployment module is responsible for creating an ECS cluster, a Fargate task definition that runs an MLflow tracking service, and run an ECS service that uses deployed instances of this task.

Input Variables

Name Description Type
aws_region AWS region string
mlflow_server_cluster Name of MLflow server cluster to be created string
ecs_service_name Name of ECS Fargate service name to be created string
cloudwatch_log_group Name of cloudwatch log group associated with service to be created string
mlflow_ecs_task_family Name of the ECS task family to be created string
container_name Name of the container to be run using a Fargate task string
s3_bucket_name Name of the S3 bucket that will be the artifact store string
s3_bucket_folder Name of the folder in the S3 bucket that will be used to store models string
db_name Name of the database that will be used as the backend store string
db_host Hostname of the database that will be used as the backend store string
db_port Port of the database connection string
docker_image URL of the docker image that will be used to run the task in Fargate string
task_memory Total memory to be used by a single Fargate task (in MiB) number
task_cpu Number of CPU units to be used by a single Fargate task number
admin_username_arn ARN of the RDS admin username secret string
admin_password_arn ARN of the RDS admin password secret string
task_role_arn ARN of the IAM task role string
execution_role_arn ARN of the IAM execution role string
aws_lb_target_group_arn ARN of the load balancer target group string
ecs_service_subnets Subnets to be used in the network configuration of the created ECS service list(string)
ecs_service_security_groups Security groups to be attached to the created ECS service list(string)

Output Values

Name Description
ecs_service_name Name of the created ECS service
ecs_cluster_name Name of the created ECS cluster

autoscaling

The autoscaling module is responsible for creating and attaching autoscaling policies based on CPU and memory utilization percentage.

Name Description Type
server_cluster_name Name of ECS cluster to create autoscaling policy in string
ecs_service_name Name of ECS service to create autoscaling policy in string
min_tasks Minimum number of running tasks in ECS service number
max_tasks Maximum number of running tasks in ECS service number
memory_autoscaling_policy_name Name of memory autoscaling policy to be created string
cpu_autoscaling_policy_name Name of CPU autoscaling policy to be created string
memory_autoscale_in_cooldown Cooldown time for scale in based on memory metric (in seconds) number
memory_autoscale_out_cooldown Cooldown time for scale out based on memory metric (in seconds) number
memory_autoscale_target Target value of memory utilization percentage in each task number
cpu_autoscale_in_cooldown Cooldown time for scale in based on CPU metric (in seconds) number
cpu_autoscale_out_cooldown Cooldown time for scale out based on CPU metric (in seconds) number
cpu_autoscale_target Target value of CPU utilization percentage in each task number

Useful Commands

make start

To create an MLflow server running on ECS Fargate from scratch, there is a 3-step process involved:

  1. Create the Docker image registry on ECR.

    terraform apply -target=module.container_repository.aws_ecr_repository.mlflow_container_repository
    
  2. Push the local Dockerfile to ECR.

    bash docker_push.sh
    
  3. Review and authorize changes to the remaining infrastructure.

    terraform apply
    

The process of running these steps can be done using make start.

make update_container

To update the base image that is used to run the MLflow server in ECS, there is a 3-step process involved:

  1. Destroy the existing services and its dependent services.

    terraform destroy -target=module.task_deployment.aws_ecs_service.service
    
  2. Push the local Dockerfile to ECR.

    bash docker_push.sh
    
  3. Re-apply and authorize the changes to create the service in ECS.

    terraform apply
    

The process of running these steps can be done using make update_container.