This repository provides IaC (Infrastructure as Code) to replicate the environment used to produce the results of the research paper and can serve as a starting point if you're looking to do so. All the necessary resources are created in Amazon Web Services (AWS) cloud infrastructure via Terraform. The Terraform configuration is wrapped into a single module that leverages a number of sub-modules. The root module primarily deploys the OpenCBDC test controller along with numerous supporting resources. You can follow along with the steps of this README in order to deploy the test controller. If you are new to Terraform, when you reach Provision, it is reccomended that you use the pre-created configuration linked there as the entrypoint for your deployment.
This module will deploy the test controller via an AWS ECS task.
The ECS service can be configured to use either EC2 instances or Fargate.
The main function of the test controller is to schedule agent processes across one to three regions for testing Project Hamilton's architectures.
Agents processes are scheduled on AWS EC2 instances and provisioned via EC2 launch templates.
The test controller is configured to provision in the us-east-1
region.
A subset of resources are replicated in the us-east-2
and us-west-2
regions in order to schedule multi regional test runs.
A VPC is provisioned in each of these three regions along with VPC peering connections and VPC endpoints for internal communication between resources.
A pipeline is setup via AWS Codepipeline which will clone the test controller's source code, then build/push several services.
These services are a container image for the test controller, a container image used to seed the environment with data for test runs, and the binary used to schedule agents during test runs.
Both of the container images are pushed to AWS ECR registries, and the agent binary is pushed to an S3 bucket.
Seeding initial outputs is handled via an AWS Batch job that when necessary is scheduled by the test controller before a test run.
An AWS Batch compute environment, job definition, and job queue are all provisioned by default to support this.
Upon being schdeuled, agents instances pull the agent binary from S3, then execute it to communicate with the test controller and recieve instructions.
This process for the agents is defined in thier EC2 launch template.
Two AWS Network Load balancers are deployed by the module.
One forwards traffic to the test controller's UI, the other supports communication between agents and the test controller.
A bastion host is provided for troubleshooting the environment as well as pulling down raw test data if you wish to gather your own insights.
To access to the bastion host you can either use ssh, which is configured by this module, or you can use AWS Session Manager.
The module requires that you have Terraform installed.
Specifics about versioning are listed here.
Also useful, but not completely necessary is the AWS CLI.
If you have other Terraform projects with different version requirements, you can manage them with tfenv.
This project is pre-configured to pull the proper terraform version via tfenv.
Simply run tfenv install
.
Docker must be installed and running on your local machine.
You won't need to run any Docker commands, just be sure that it's running.
If you're unfamiliar with Docker and curious, you can take a look at their getting started page.
This module requires you provide an ssh public key which will be used to generate an Amazon EC2 key pair.
AWS can use either ED25519 or 2048-bit SSH-2 RSA keys.
There are a number of third party tools that can be used to generate an approrpiate keypair.
One way is via the ssh-keygen
command provided by OpenSSH.
$ ssh-keygen -t RSA -f /path/to/key/file/id_rsa
Installation for OpenSSH will depend on the OS of your machine.
- On MacOS OpenSSH should be installed by default.
- On Windows you may need to follow addional steps.
- On Ubuntu/Debian/Linux Mint:
$ sudo apt-get install openssh-client
- On RHEL/Centos/Fedora:
$ sudo yum -y install openssh-clients
After doing so, provide the contents of the public key (id_rsa.pub
) file to the module's public_key
var.
The ssh private key should remain private.
New Domain - Currently, the test controller requires that you own a domain with a registrar and a hosted zone configured in Route53.
The name of the hosted zone should be set as the base_domain
var and the necessary DNS records will be created by this Terraform module.
If you don't currently own a domain, you can purchase one via the Route53 registrar, doing so creates a hosted zone in Route53 automatically.
This is our recommended approach.
BYO Domain - If you already own a domain that you wish to use you can do so, however you'll still need to create a hosted zone in Route53.
The module output route53_endpoints.name_servers
will provide a list of name servers associated with the hosted zone.
Use these to delegate DNS resolution for the domain to Route53.
Usually this is done by creating an NS record wherever the base domain is hosted.
For BYO domains, we recommend using a sub-domain (test.foo.com) as base_domain rather than using a top level domain (foo.com) and delegating name server resolution to route53 for that subdomain.
This module will create several certificates in AWS Certificate Manager which use DNS for validation.
Be sure that your base domain is updated before you run terraform apply
or else the certificates will fail to validate.
Once deployed, this module will create a pipeline in AWS Codepipeline, which builds and pushes several container images related to the test controller.
In order to perform this Codepipeline will clone the test controller codebase.
Codepipeline must be connected to a Github account to clone from a Github repo.
A personal access token should be passed to codepipeline for authentication.
The token should only need the public_repo
permission.
After creating this, you can provide it to the module via the test_controller_github_access_token
var.
Terraform will require permission to access multiple services in AWS. Permissions in AWS are managed via the IAM service. Generally speaking you want to provide the smallest set of permissions possible to a role. This is known as the Principle of Least Privilege. Since Terraform here will be interacting with such a wide array of services to deploy the test controller, for simplicity you can grant Administrator Access. This can be attached to an IAM user that Terraform can authenticate against. If you'd like to restrict Terraform's access with a fine toothed comb however you certainly can.
This repo contains Terraform configuration mirroring that of the research paper here. This is intended to serve as your main entrypoint for your deployment. Deployment instructions are located here. If you want to configure the environment for your own tests this module provides a number of inputs for doing so.
The test controller requires an SSL certificate to allow for client connections via HTTPS. This module will provision a Lambda capable of generating an appropriate cert issued via Let's Encrypt. The lambda is configured to fire off every twelve hours to check that the cert has yet to expire. If you wish to run tests in your environment immediately provisioning, you will need to invoke the lambda yourself. You do this via the AWS CLI. Using the credentials you configured for your environment, run:
$ aws lambda invoke --region us-east-1 --function-name test-controller-certbot-lambda /dev/stdout
Note - The lambda usually takes a few minutes to complete it's execution.
Note - The lambda will create a certificate in AWS Certificate Manager.
This is not tied to the terraform automation, so you will need to delete it manually after running a terraform destroy
.
You should delete it only after you've destroyed everything else.
To do so, simply select the certificate with the test controller domain name test-controller.<base_domain>
and hit "delete".
The test controller pipeline should run automatically.
All pipeline phases must succeed before you can run any tests.
You can verify this by checking the most recent execution status of test-controller-pipeline
in the AWS Codepipeline service.
Codepipeline will poll for the latest changes to the test controller repo. This way you will recieve updates automatically without any manual intervention. Occasionally, Codepipeline may fail during the deployment process. These are usually transient errors which will resolve by simply running the pipeline again. Using the credentials you configured for your environment, run:
$ aws lambda invoke --region us-east-1 --function-name test-controller-certbot-lambda /dev/stdout
Both the test controller's UI and API exist inside of a single ECS task.
The task must be running and healthy before you can schedule test runs in your environment.
Three sets of target groups are configured against the task, one as an entrypoint for agents, one for authentication, and one for the test controller's UI.
The task will be scheduled under the test-controller
service, which belongs to a cluster with the same name as whatever the Terraform var environment
is set to.
It's easiest to verify these in the AWS console.
When the environment is healthy, these services should look like the following:
The module will generate some DNS records in AWS Route53 for you.
A CNAME record is created in Route53 which will point to the UI load balancer.
The format of this will be test-controller.<base_domain>
.
The environment
and base_domain
values will be set to whatever you configured to the corresponding Terraform vars.
Assuming your environment is up and configured properly, you should be able to access by typing the url into any browser.
In a fresh environment, you will need to add a client certificate into the environment in order to authenticate with the test controller.
The process for this is documented in the test controller's README.
Note - This module configures the port 8443 to route to the auth endpoint via the network load balancer.
This means the port must be specified in the url you enter into the browser https://test-controller.<base_domain>:8443/auth
.
The appropriate record is also provided as an output route53_endpoints.ui_endpoint
.
Some plots shown in the paper require a great deal of compute power to reproduce. The default quotas for EC2 instances set on AWS accounts will likely be insufficient in some cases. The test controller will schedule instances using available vCPUs based on the service quota API, meaning it will run what it can instead of reporting errors. To reproduce entire plots, you will need to submit requests limit increases on several EC2 service quotas. Specifically:
Quota Name | us-east-1 | us-east-2 | us-west-2 |
---|---|---|---|
All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests | 32,000 | 32,000 | 32,000 |
Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances | 32,000 | 32,000 | 32,000 |
Name | Version |
---|---|
terraform | = 0.13.4 |
aws | ~> 3.0 |
Name | Version |
---|---|
aws | ~> 3.0 |
aws.use2 | ~> 3.0 |
aws.usw2 | ~> 3.0 |
template | n/a |
Name | Source | Version |
---|---|---|
bastion | ./modules/bastion | n/a |
ec2_profile | terraform-aws-modules/ecs/aws//modules/ecs-instance-profile | 3.0.0 |
ecs | terraform-aws-modules/ecs/aws | 3.0.0 |
ecs_cluster_asg | terraform-aws-modules/autoscaling/aws | 3.9.0 |
ecs_cluster_security_group | terraform-aws-modules/security-group/aws | 3.1.0 |
route53_dns | ./modules/route53_dns | n/a |
test_controller_agent_use1 | ./modules/test-controller-agent | n/a |
test_controller_agent_use2 | ./modules/test-controller-agent | n/a |
test_controller_agent_usw2 | ./modules/test-controller-agent | n/a |
test_controller_deploy | ./modules/test-controller-deploy | n/a |
test_controller_service | ./modules/test-controller | n/a |
uhs_seed_generator | ./modules/uhs-seed-generator | n/a |
vpc | terraform-aws-modules/vpc/aws | 2.70.0 |
vpc_endpoints_use1 | ./modules/vpc-endpoints | n/a |
vpc_endpoints_use2 | ./modules/vpc-endpoints | n/a |
vpc_endpoints_usw2 | ./modules/vpc-endpoints | n/a |
vpc_peering_connection_use1_use2 | ./modules/vpc-peering-connection | n/a |
vpc_peering_connection_use1_usw2 | ./modules/vpc-peering-connection | n/a |
vpc_peering_connection_use2_usw2 | ./modules/vpc-peering-connection | n/a |
vpc_use2 | terraform-aws-modules/vpc/aws | 2.70.0 |
vpc_usw2 | terraform-aws-modules/vpc/aws | 2.70.0 |
Name | Type |
---|---|
aws_cloudwatch_log_group.agents_use1 | resource |
aws_cloudwatch_log_group.agents_use2 | resource |
aws_cloudwatch_log_group.agents_usw2 | resource |
aws_iam_service_linked_role.ecs | resource |
aws_s3_bucket.agent_outputs | resource |
aws_s3_bucket.binaries | resource |
aws_availability_zones.use1 | data source |
aws_availability_zones.use2 | data source |
aws_availability_zones.usw2 | data source |
aws_caller_identity.current | data source |
aws_region.current | data source |
aws_ssm_parameter.ecs_optimized_ami | data source |
template_file.user_data | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
agent_instance_types | The instance types used in agent launch templates. | list(string) |
[ |
no |
base_domain | Base domain to use for ACM Cert and Route53 record management. | string |
"" |
no |
cluster_instance_type | If test controller launch type is EC2, the instance size to use. | string |
"c5ad.12xlarge" |
no |
create_certbot_lambda | Boolean to create the certbot lambda to update the letsencrypt cert for the test controller. | bool |
true |
no |
create_uhs_seed_generator | Determines whether or not to create uhs seed generator resources | bool |
true |
no |
environment | AWS tag to indicate environment name of each infrastructure object. | string |
n/a | yes |
lets_encrypt_email | Email to associate with let's encrypt certificate | string |
n/a | yes |
private_subnet_tags | Tags associated with private subnets | map(string) |
{} |
no |
public_key | SSH public key to use in EC2 instances. | string |
"" |
no |
public_subnet_tags | Tags associated with public subnets | map(string) |
{} |
no |
resource_tags | Tags to set for all resources | map(string) |
{} |
no |
subnet_prefix_extension | CIDR block bits extension to calculate CIDR blocks of each subnetwork. | number |
4 |
no |
test_controller_app_container_base_image | An optional custom container base image for the test controller and releated services | string |
"ubuntu:20.04" |
no |
test_controller_cpu | The ECS task CPU | string |
"4096" |
no |
test_controller_github_access_token | Access token for cloning test controller repo | string |
n/a | yes |
test_controller_github_repo | The Github repo base name | string |
"opencbdc-tctl" |
no |
test_controller_github_repo_branch | The repo branch to use for the Test Controller deployment pipeline. | string |
"trunk" |
no |
test_controller_github_repo_owner | The Github repo owner | string |
"mit-dci" |
no |
test_controller_golang_container_build_image | An optional custom container build image for test controller Golang depencies | string |
"golang:1.16" |
no |
test_controller_health_check_grace_period_seconds | The ECS service health check grace period in seconds | number |
300 |
no |
test_controller_launch_type | The ECS task launch type to run the test controller. | string |
"FARGATE" |
no |
test_controller_memory | The ECS task memory | string |
"30720" |
no |
test_controller_node_container_build_image | An optional custom container build image for test controller Nodejs depencies | string |
"node:14" |
no |
transaction_processor_main_branch | Main branch of transaction repo | string |
"trunk" |
no |
transaction_processor_repo_url | Transaction repo cloned by the test controller for load generation logic | string |
"https://github.com/mit-dci/opencbdc-tx.git" |
no |
uhs_seed_generator_job_memory | Memory required for a seed generator batch job | string |
"8192" |
no |
uhs_seed_generator_job_vcpu | Vcpus required for a seed generator batch job | string |
"4" |
no |
uhs_seed_generator_max_vcpus | Max vcpus allocatable to the seed generator environment | string |
"50" |
no |
use1_main_network_block | Base CIDR block to be used in us-east-1. | string |
"10.0.0.0/16" |
no |
use2_main_network_block | Base CIDR block to be used in us-east-2. | string |
"10.10.0.0/16" |
no |
usw2_main_network_block | Base CIDR block to be used in us-west-2. | string |
"10.20.0.0/16" |
no |
zone_offset | CIDR block bits extension offset to calculate Public subnets, avoiding collisions with Private subnets. | number |
8 |
no |
Name | Description |
---|---|
azs_use1 | Availability zones used by VPC located in us-east-1 region |
azs_use2 | Availability zones used by VPC located in us-east-2 region |
azs_usw2 | Availability zones used by VPC located in us-west-2 region |
ecs_cluster_id | ECS cluster id |
ecs_cluster_name | ECS cluster name |
private_subnets_use1 | Private subnet Ids associated with VPC in us-east-1 region |
private_subnets_use2 | Private subnet Ids associated with VPC in us-east-2 region |
private_subnets_usw2 | Private subnet Ids associated with VPC in us-west-2 region |
public_subnets_use1 | Public subnet Ids associated with VPC in us-east-1 region |
public_subnets_use2 | Public subnet Ids associated with VPC in us-east-2 region |
public_subnets_usw2 | Public subnet Ids associated with VPC in us-west-2 region |
route53_endpoints | Route53 endpoints generated by test controller services |
s3_vpc_interface_endpoint_use1 | S3 service interface endpoint asscoiated with VPC in us-east-1 region |
s3_vpc_interface_endpoint_use2 | S3 service interface endpoint asscoiated with VPC in us-east-2 region |
s3_vpc_interface_endpoint_usw2 | S3 service interface endpoint asscoiated with VPC in us-west-2 region |
vpc_id_use1 | Id of VPC in us-east-1 region |
vpc_id_use2 | Id of VPC in us-east-2 region |
vpc_id_usw2 | Id of VPC in us-west-2 region |