Skip to content

Latest commit

 

History

History
303 lines (260 loc) · 28.3 KB

README.md

File metadata and controls

303 lines (260 loc) · 28.3 KB

Introduction

This repository provides IaC (Infrastructure as Code) to replicate the environment used to produce the results of the research paper and can serve as a starting point if you're looking to do so. All the necessary resources are created in Amazon Web Services (AWS) cloud infrastructure via Terraform. The Terraform configuration is wrapped into a single module that leverages a number of sub-modules. The root module primarily deploys the OpenCBDC test controller along with numerous supporting resources. You can follow along with the steps of this README in order to deploy the test controller. If you are new to Terraform, when you reach Provision, it is reccomended that you use the pre-created configuration linked there as the entrypoint for your deployment.

Architecture

This module will deploy the test controller via an AWS ECS task. The ECS service can be configured to use either EC2 instances or Fargate. The main function of the test controller is to schedule agent processes across one to three regions for testing Project Hamilton's architectures. Agents processes are scheduled on AWS EC2 instances and provisioned via EC2 launch templates. The test controller is configured to provision in the us-east-1 region. A subset of resources are replicated in the us-east-2 and us-west-2 regions in order to schedule multi regional test runs. A VPC is provisioned in each of these three regions along with VPC peering connections and VPC endpoints for internal communication between resources. A pipeline is setup via AWS Codepipeline which will clone the test controller's source code, then build/push several services. These services are a container image for the test controller, a container image used to seed the environment with data for test runs, and the binary used to schedule agents during test runs. Both of the container images are pushed to AWS ECR registries, and the agent binary is pushed to an S3 bucket. Seeding initial outputs is handled via an AWS Batch job that when necessary is scheduled by the test controller before a test run. An AWS Batch compute environment, job definition, and job queue are all provisioned by default to support this. Upon being schdeuled, agents instances pull the agent binary from S3, then execute it to communicate with the test controller and recieve instructions. This process for the agents is defined in thier EC2 launch template. Two AWS Network Load balancers are deployed by the module. One forwards traffic to the test controller's UI, the other supports communication between agents and the test controller. A bastion host is provided for troubleshooting the environment as well as pulling down raw test data if you wish to gather your own insights. To access to the bastion host you can either use ssh, which is configured by this module, or you can use AWS Session Manager.

Diagram

Required Software

The module requires that you have Terraform installed. Specifics about versioning are listed here. Also useful, but not completely necessary is the AWS CLI. If you have other Terraform projects with different version requirements, you can manage them with tfenv. This project is pre-configured to pull the proper terraform version via tfenv. Simply run tfenv install. Docker must be installed and running on your local machine. You won't need to run any Docker commands, just be sure that it's running. If you're unfamiliar with Docker and curious, you can take a look at their getting started page.

Pre-Provision

Generate and Add an SSH Key

This module requires you provide an ssh public key which will be used to generate an Amazon EC2 key pair. AWS can use either ED25519 or 2048-bit SSH-2 RSA keys. There are a number of third party tools that can be used to generate an approrpiate keypair. One way is via the ssh-keygen command provided by OpenSSH.

$ ssh-keygen -t RSA -f /path/to/key/file/id_rsa

Installation for OpenSSH will depend on the OS of your machine.

  • On MacOS OpenSSH should be installed by default.
  • On Windows you may need to follow addional steps.
  • On Ubuntu/Debian/Linux Mint:
$ sudo apt-get install openssh-client
  • On RHEL/Centos/Fedora:
$ sudo yum -y install openssh-clients

After doing so, provide the contents of the public key (id_rsa.pub) file to the module's public_key var. The ssh private key should remain private.

Register a Domain

New Domain - Currently, the test controller requires that you own a domain with a registrar and a hosted zone configured in Route53. The name of the hosted zone should be set as the base_domain var and the necessary DNS records will be created by this Terraform module. If you don't currently own a domain, you can purchase one via the Route53 registrar, doing so creates a hosted zone in Route53 automatically. This is our recommended approach.

BYO Domain - If you already own a domain that you wish to use you can do so, however you'll still need to create a hosted zone in Route53. The module output route53_endpoints.name_servers will provide a list of name servers associated with the hosted zone. Use these to delegate DNS resolution for the domain to Route53. Usually this is done by creating an NS record wherever the base domain is hosted. For BYO domains, we recommend using a sub-domain (test.foo.com) as base_domain rather than using a top level domain (foo.com) and delegating name server resolution to route53 for that subdomain. This module will create several certificates in AWS Certificate Manager which use DNS for validation. Be sure that your base domain is updated before you run terraform apply or else the certificates will fail to validate.

Generate and Add a Github Access Token

Once deployed, this module will create a pipeline in AWS Codepipeline, which builds and pushes several container images related to the test controller. In order to perform this Codepipeline will clone the test controller codebase. Codepipeline must be connected to a Github account to clone from a Github repo. A personal access token should be passed to codepipeline for authentication. The token should only need the public_repo permission. After creating this, you can provide it to the module via the test_controller_github_access_token var.

Configure IAM Permissions

Terraform will require permission to access multiple services in AWS. Permissions in AWS are managed via the IAM service. Generally speaking you want to provide the smallest set of permissions possible to a role. This is known as the Principle of Least Privilege. Since Terraform here will be interacting with such a wide array of services to deploy the test controller, for simplicity you can grant Administrator Access. This can be attached to an IAM user that Terraform can authenticate against. If you'd like to restrict Terraform's access with a fine toothed comb however you certainly can.

Provision

This repo contains Terraform configuration mirroring that of the research paper here. This is intended to serve as your main entrypoint for your deployment. Deployment instructions are located here. If you want to configure the environment for your own tests this module provides a number of inputs for doing so.

Post-Provision

Invoke the Certbot Lambda

The test controller requires an SSL certificate to allow for client connections via HTTPS. This module will provision a Lambda capable of generating an appropriate cert issued via Let's Encrypt. The lambda is configured to fire off every twelve hours to check that the cert has yet to expire. If you wish to run tests in your environment immediately provisioning, you will need to invoke the lambda yourself. You do this via the AWS CLI. Using the credentials you configured for your environment, run:

$ aws lambda invoke --region us-east-1 --function-name test-controller-certbot-lambda /dev/stdout


Note - The lambda usually takes a few minutes to complete it's execution.
Note - The lambda will create a certificate in AWS Certificate Manager. This is not tied to the terraform automation, so you will need to delete it manually after running a terraform destroy. You should delete it only after you've destroyed everything else. To do so, simply select the certificate with the test controller domain name test-controller.<base_domain> and hit "delete".

Monitor Codepipeline

The test controller pipeline should run automatically. All pipeline phases must succeed before you can run any tests. You can verify this by checking the most recent execution status of test-controller-pipeline in the AWS Codepipeline service.

Diagram

Codepipeline will poll for the latest changes to the test controller repo. This way you will recieve updates automatically without any manual intervention. Occasionally, Codepipeline may fail during the deployment process. These are usually transient errors which will resolve by simply running the pipeline again. Using the credentials you configured for your environment, run:

$ aws lambda invoke --region us-east-1 --function-name test-controller-certbot-lambda /dev/stdout

Monitor Health Checks

Both the test controller's UI and API exist inside of a single ECS task. The task must be running and healthy before you can schedule test runs in your environment. Three sets of target groups are configured against the task, one as an entrypoint for agents, one for authentication, and one for the test controller's UI. The task will be scheduled under the test-controller service, which belongs to a cluster with the same name as whatever the Terraform var environment is set to. It's easiest to verify these in the AWS console. When the environment is healthy, these services should look like the following:

Running ECS Task

Healthy Target Group

Access the Test Controller

The module will generate some DNS records in AWS Route53 for you. A CNAME record is created in Route53 which will point to the UI load balancer. The format of this will be test-controller.<base_domain>. The environment and base_domain values will be set to whatever you configured to the corresponding Terraform vars. Assuming your environment is up and configured properly, you should be able to access by typing the url into any browser. In a fresh environment, you will need to add a client certificate into the environment in order to authenticate with the test controller. The process for this is documented in the test controller's README.
Note - This module configures the port 8443 to route to the auth endpoint via the network load balancer. This means the port must be specified in the url you enter into the browser https://test-controller.<base_domain>:8443/auth. The appropriate record is also provided as an output route53_endpoints.ui_endpoint.

Request Limit Increases (Optional)

Some plots shown in the paper require a great deal of compute power to reproduce. The default quotas for EC2 instances set on AWS accounts will likely be insufficient in some cases. The test controller will schedule instances using available vCPUs based on the service quota API, meaning it will run what it can instead of reporting errors. To reproduce entire plots, you will need to submit requests limit increases on several EC2 service quotas. Specifically:

Quota Name us-east-1 us-east-2 us-west-2
All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests 32,000 32,000 32,000
Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances 32,000 32,000 32,000

Requirements

Name Version
terraform = 0.13.4
aws ~> 3.0

Providers

Name Version
aws ~> 3.0
aws.use2 ~> 3.0
aws.usw2 ~> 3.0
template n/a

Modules

Name Source Version
bastion ./modules/bastion n/a
ec2_profile terraform-aws-modules/ecs/aws//modules/ecs-instance-profile 3.0.0
ecs terraform-aws-modules/ecs/aws 3.0.0
ecs_cluster_asg terraform-aws-modules/autoscaling/aws 3.9.0
ecs_cluster_security_group terraform-aws-modules/security-group/aws 3.1.0
route53_dns ./modules/route53_dns n/a
test_controller_agent_use1 ./modules/test-controller-agent n/a
test_controller_agent_use2 ./modules/test-controller-agent n/a
test_controller_agent_usw2 ./modules/test-controller-agent n/a
test_controller_deploy ./modules/test-controller-deploy n/a
test_controller_service ./modules/test-controller n/a
uhs_seed_generator ./modules/uhs-seed-generator n/a
vpc terraform-aws-modules/vpc/aws 2.70.0
vpc_endpoints_use1 ./modules/vpc-endpoints n/a
vpc_endpoints_use2 ./modules/vpc-endpoints n/a
vpc_endpoints_usw2 ./modules/vpc-endpoints n/a
vpc_peering_connection_use1_use2 ./modules/vpc-peering-connection n/a
vpc_peering_connection_use1_usw2 ./modules/vpc-peering-connection n/a
vpc_peering_connection_use2_usw2 ./modules/vpc-peering-connection n/a
vpc_use2 terraform-aws-modules/vpc/aws 2.70.0
vpc_usw2 terraform-aws-modules/vpc/aws 2.70.0

Resources

Name Type
aws_cloudwatch_log_group.agents_use1 resource
aws_cloudwatch_log_group.agents_use2 resource
aws_cloudwatch_log_group.agents_usw2 resource
aws_iam_service_linked_role.ecs resource
aws_s3_bucket.agent_outputs resource
aws_s3_bucket.binaries resource
aws_availability_zones.use1 data source
aws_availability_zones.use2 data source
aws_availability_zones.usw2 data source
aws_caller_identity.current data source
aws_region.current data source
aws_ssm_parameter.ecs_optimized_ami data source
template_file.user_data data source

Inputs

Name Description Type Default Required
agent_instance_types The instance types used in agent launch templates. list(string)
[
"c5n.large",
"c5n.2xlarge",
"c5n.9xlarge",
"c5n.metal"
]
no
base_domain Base domain to use for ACM Cert and Route53 record management. string "" no
cluster_instance_type If test controller launch type is EC2, the instance size to use. string "c5ad.12xlarge" no
create_certbot_lambda Boolean to create the certbot lambda to update the letsencrypt cert for the test controller. bool true no
create_uhs_seed_generator Determines whether or not to create uhs seed generator resources bool true no
environment AWS tag to indicate environment name of each infrastructure object. string n/a yes
lets_encrypt_email Email to associate with let's encrypt certificate string n/a yes
private_subnet_tags Tags associated with private subnets map(string) {} no
public_key SSH public key to use in EC2 instances. string "" no
public_subnet_tags Tags associated with public subnets map(string) {} no
resource_tags Tags to set for all resources map(string) {} no
subnet_prefix_extension CIDR block bits extension to calculate CIDR blocks of each subnetwork. number 4 no
test_controller_app_container_base_image An optional custom container base image for the test controller and releated services string "ubuntu:20.04" no
test_controller_cpu The ECS task CPU string "4096" no
test_controller_github_access_token Access token for cloning test controller repo string n/a yes
test_controller_github_repo The Github repo base name string "opencbdc-tctl" no
test_controller_github_repo_branch The repo branch to use for the Test Controller deployment pipeline. string "trunk" no
test_controller_github_repo_owner The Github repo owner string "mit-dci" no
test_controller_golang_container_build_image An optional custom container build image for test controller Golang depencies string "golang:1.16" no
test_controller_health_check_grace_period_seconds The ECS service health check grace period in seconds number 300 no
test_controller_launch_type The ECS task launch type to run the test controller. string "FARGATE" no
test_controller_memory The ECS task memory string "30720" no
test_controller_node_container_build_image An optional custom container build image for test controller Nodejs depencies string "node:14" no
transaction_processor_main_branch Main branch of transaction repo string "trunk" no
transaction_processor_repo_url Transaction repo cloned by the test controller for load generation logic string "https://github.com/mit-dci/opencbdc-tx.git" no
uhs_seed_generator_job_memory Memory required for a seed generator batch job string "8192" no
uhs_seed_generator_job_vcpu Vcpus required for a seed generator batch job string "4" no
uhs_seed_generator_max_vcpus Max vcpus allocatable to the seed generator environment string "50" no
use1_main_network_block Base CIDR block to be used in us-east-1. string "10.0.0.0/16" no
use2_main_network_block Base CIDR block to be used in us-east-2. string "10.10.0.0/16" no
usw2_main_network_block Base CIDR block to be used in us-west-2. string "10.20.0.0/16" no
zone_offset CIDR block bits extension offset to calculate Public subnets, avoiding collisions with Private subnets. number 8 no

Outputs

Name Description
azs_use1 Availability zones used by VPC located in us-east-1 region
azs_use2 Availability zones used by VPC located in us-east-2 region
azs_usw2 Availability zones used by VPC located in us-west-2 region
ecs_cluster_id ECS cluster id
ecs_cluster_name ECS cluster name
private_subnets_use1 Private subnet Ids associated with VPC in us-east-1 region
private_subnets_use2 Private subnet Ids associated with VPC in us-east-2 region
private_subnets_usw2 Private subnet Ids associated with VPC in us-west-2 region
public_subnets_use1 Public subnet Ids associated with VPC in us-east-1 region
public_subnets_use2 Public subnet Ids associated with VPC in us-east-2 region
public_subnets_usw2 Public subnet Ids associated with VPC in us-west-2 region
route53_endpoints Route53 endpoints generated by test controller services
s3_vpc_interface_endpoint_use1 S3 service interface endpoint asscoiated with VPC in us-east-1 region
s3_vpc_interface_endpoint_use2 S3 service interface endpoint asscoiated with VPC in us-east-2 region
s3_vpc_interface_endpoint_usw2 S3 service interface endpoint asscoiated with VPC in us-west-2 region
vpc_id_use1 Id of VPC in us-east-1 region
vpc_id_use2 Id of VPC in us-east-2 region
vpc_id_usw2 Id of VPC in us-west-2 region