Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BFD-3701: Update BFD Server load balancing to support Blue/Green Deployments #2546

Merged
merged 26 commits into from
Feb 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
d72da6d
Disregard logging bucket
mjburling Feb 5, 2025
3eed4be
Updating ASG Module to add new NLB resources
malessi Feb 7, 2025
921046d
Use null_resource resources to set target group ARNs avoiding provide…
malessi Feb 11, 2025
56bd9d9
Manage ASG Warm Pool with null_resource to avoid non-zero downtime de…
malessi Feb 11, 2025
b1368bf
Update configuration and module reference values
malessi Feb 11, 2025
af8fc97
Remove unnecessary null checks
malessi Feb 11, 2025
2464d2b
Update security group rules to create per-listener/target group; fix …
malessi Feb 11, 2025
8f278d4
Remove bfd_server_lb module and module reference
malessi Feb 11, 2025
daee277
Add TODO for updating lb_alarms module
malessi Feb 11, 2025
195c73c
Update Regression Suite Lambda to target the green listener on the NLB
malessi Feb 11, 2025
65c0021
Exclude .groovy files from shellcheck pre-commit hook
malessi Feb 11, 2025
5ec1535
Remove unnecessary port variable in Launch Template
malessi Feb 11, 2025
1d38afb
Remove access logs and connection logs configuration on load balancer
malessi Feb 11, 2025
82b1fb0
Enable Load Balancer deletion protection if the environment is non-ep…
malessi Feb 11, 2025
2fb3d3e
Run terraform-docs
malessi Feb 11, 2025
7017910
Revert bfd_server_lb removal
malessi Feb 12, 2025
ec1a473
Update bfd-server-lb to be conditionally applied if env is non-epheme…
malessi Feb 12, 2025
37d7f4e
Fix set_target_groups detaching blue ASG from blue target group erron…
malessi Feb 12, 2025
45b4d6f
Fix outgoing blue remaining in blue TG when scaling-in by setting it …
malessi Feb 12, 2025
bd9345a
Fix scaled-in green ASG moving to blue during applys where no changes…
malessi Feb 13, 2025
eed5872
Ensure Launch Template version for blue ASG does not change during an…
malessi Feb 13, 2025
653e0d5
Simplify set_target_groups null_resource now that green/blue are the …
malessi Feb 13, 2025
6ae6eba
Re-enable Classic Load Balancer Alarms module with TODO indicating th…
malessi Feb 13, 2025
c2114bb
Add ingress rule for legacy CLB Security Group if CLBs are enabled
malessi Feb 13, 2025
4d7d163
Update README with an overview of the Blue/Green strategy
malessi Feb 13, 2025
3bdadbc
Simplify lb_config variable structure to remove invariant properties
malessi Feb 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/scripts/pre-commit.sh
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,9 @@ runShellCheckForCommitFiles() {
filename=$(basename -- "$file")
extension="${filename##*.}"

# Skip binary formats
# Skip binary formats and groovy files
case "$extension" in
"zip" | "p12" | "pfx" | "cer" | "pem" | "png" | "jpg")
"zip" | "p12" | "pfx" | "cer" | "pem" | "png" | "jpg" | "groovy")
continue ;;
*) ;;
esac
Expand Down
5 changes: 4 additions & 1 deletion apps/utils/locust_tests/lambda/server-regression/app.py
mjburling marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,9 @@ def handler(event, context):
cert = get_ssm_parameter(
f"/bfd/{environment}/server/sensitive/server_regression_cert", with_decrypt=True
)
green_port = get_ssm_parameter(
f"/bfd/{environment}/server/nonsensitive/lb_green_ingress_port"
)
except ValueError as exc:
send_pipeline_signal(
signal_queue_url=signal_queue_url,
Expand Down Expand Up @@ -191,7 +194,7 @@ def handler(event, context):
[
"locust",
f"--locustfile=/var/task/{invoke_event.suite_version}/{locust_file}",
f"--host={invoke_event.host}",
f"--host={invoke_event.host}:{green_port}",
f"--users={invoke_event.users}",
f"--spawn-rate={invoke_event.spawn_rate}",
f"--spawned-runtime={invoke_event.spawned_runtime}",
Expand Down
3 changes: 2 additions & 1 deletion ops/jenkins/global-pipeline-libraries/vars/awsElb.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
// awsElb.groovy contains methods that wrap awscli elb subcommands

// Returns the Elastic Load Balancer's DNSName for the given environment
// See ops/terraform/services/server/modules/bfd_server_asg/main.tf for NLB definition and naming scheme
String getElbDnsName(String environment) {
elbDnsName = sh(returnStdout: true, script: "aws elb describe-load-balancers --load-balancer-names bfd-${environment}-fhir --query 'LoadBalancerDescriptions[0].DNSName' --output text").trim()
elbDnsName = sh(returnStdout: true, script: "aws elbv2 describe-load-balancers --names bfd-${environment}-fhir-nlb --query 'LoadBalancer[0].DNSName' --output text").trim()
return elbDnsName
}
4 changes: 2 additions & 2 deletions ops/terraform/services/base/values/ephemeral.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,8 @@
/bfd/${env}/server/nonsensitive/pac/claim_source_types: fiss,mcs
/bfd/${env}/server/nonsensitive/c4dic/enabled: "false"
/bfd/${env}/server/nonsensitive/lb_is_public: false
/bfd/${env}/server/nonsensitive/lb_ingress_port: 443
/bfd/${env}/server/nonsensitive/lb_egress_port: 7443
/bfd/${env}/server/nonsensitive/lb_blue_ingress_port: 443
/bfd/${env}/server/nonsensitive/lb_green_ingress_port: 7443
/bfd/${env}/server/nonsensitive/launch_template_volume_iops: 3000
/bfd/${env}/server/nonsensitive/launch_template_volume_size_gb: 60
/bfd/${env}/server/nonsensitive/launch_template_volume_throughput: 250
Expand Down
4 changes: 2 additions & 2 deletions ops/terraform/services/base/values/prod-sbx.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -136,8 +136,8 @@
/bfd/${env}/server/nonsensitive/heathcheck/testing_bene_id: "-88888888888888"
/bfd/${env}/server/nonsensitive/paths/files/war: UNDEFINED
/bfd/${env}/server/nonsensitive/lb_is_public: "true"
/bfd/${env}/server/nonsensitive/lb_ingress_port: "443"
/bfd/${env}/server/nonsensitive/lb_egress_port: "7443"
/bfd/${env}/server/nonsensitive/lb_blue_ingress_port: "443"
/bfd/${env}/server/nonsensitive/lb_green_ingress_port: "7443"
/bfd/${env}/server/nonsensitive/lb_vpc_peerings_json: '[ "bfd-prod-sbx-to-ab2d-dev", "bfd-prod-sbx-to-ab2d-impl", "bfd-prod-sbx-to-ab2d-sbx", "bfd-prod-sbx-to-bcda-dev", "bfd-prod-sbx-to-bcda-test", "bfd-prod-sbx-to-bcda-sbx", "bfd-prod-sbx-to-bcda-opensbx", "bfd-prod-sbx-vpc-to-bluebutton-impl", "bfd-prod-sbx-vpc-to-bluebutton-test", "bfd-prod-sbx-vpc-to-dpc-prod-sbx-vpc", "bfd-prod-sbx-vpc-to-dpc-test-vpc", "bfd-prod-sbx-vpc-to-dpc-dev-vpc" ]'
/bfd/${env}/server/nonsensitive/asg_min_instance_count: "3"
/bfd/${env}/server/nonsensitive/asg_max_instance_count: "12"
Expand Down
4 changes: 2 additions & 2 deletions ops/terraform/services/base/values/prod.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -183,8 +183,8 @@
/bfd/${env}/server/nonsensitive/heathcheck/testing_bene_id: "-88888888888888"
/bfd/${env}/server/nonsensitive/paths/files/war: UNDEFINED
/bfd/${env}/server/nonsensitive/lb_is_public: "false"
/bfd/${env}/server/nonsensitive/lb_ingress_port: "443"
/bfd/${env}/server/nonsensitive/lb_egress_port: "7443"
/bfd/${env}/server/nonsensitive/lb_blue_ingress_port: "443"
/bfd/${env}/server/nonsensitive/lb_green_ingress_port: "7443"
/bfd/${env}/server/nonsensitive/lb_vpc_peerings_json: '[ "bfd-prod-vpc-to-dpc-prod-vpc", "bfd-prod-vpc-to-bluebutton-prod", "bfd-prod-vpc-to-bcda-prod-vpc", "bfd-prod-to-ab2d-prod" ]'
/bfd/${env}/server/nonsensitive/asg_min_instance_count: "3"
/bfd/${env}/server/nonsensitive/asg_max_instance_count: "12"
Expand Down
4 changes: 2 additions & 2 deletions ops/terraform/services/base/values/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -188,8 +188,8 @@
/bfd/${env}/server/nonsensitive/heathcheck/testing_bene_id: "-88888888888888"
/bfd/${env}/server/nonsensitive/paths/files/war: UNDEFINED
/bfd/${env}/server/nonsensitive/lb_is_public: "false"
/bfd/${env}/server/nonsensitive/lb_ingress_port: "443"
/bfd/${env}/server/nonsensitive/lb_egress_port: "7443"
/bfd/${env}/server/nonsensitive/lb_blue_ingress_port: "443"
/bfd/${env}/server/nonsensitive/lb_green_ingress_port: "7443"
/bfd/${env}/server/nonsensitive/lb_vpc_peerings_json: '[ "bfd-test-vpc-to-bluebutton-test" ]'
/bfd/${env}/server/nonsensitive/asg_min_instance_count: "3"
/bfd/${env}/server/nonsensitive/asg_max_instance_count: "12"
Expand Down
12 changes: 11 additions & 1 deletion ops/terraform/services/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,16 @@ terraform apply

**NOTE** the above double-invocation of terraform is correct. Two executions of `terraform apply` are necessary to achieve the desired state as of BFD-2558.

## Blue/Green Workflow

This Terraservice implements the logic and resources necessary to support a Blue/Green Deployment strategy for the BFD Server.

Blue (`blue`) refers to the "active" or _production_ infrastructure that serves traffic to our consumers. Resources in `blue` are considered to "known-good" resources. Green (`green`) refers to _incoming_, new infrastructure for a _new_ version of the BFD Server that needs to be verified as good before it being promoted to `blue` and made available to serve traffic to our consumers.

This Terraservice achieves a Blue/Green Deployment strategy by utilizing two AutoScaling Groups, two Target Groups and two Load Balancer Listeners on ports `443` and `7443` that route to the aforementioned Target Groups on different ports. The Listener on port `443` (the reserved HTTPS port) is associated with the `blue` Target Group and the Listener on `7443` is associated with `green`. This way, clients using the default HTTPS port will reach the `blue` BFD Server Instances only, while our automation can reach the `green` Instances by using port `7443`.

The Terraservice logic decides which AutoScaling Group is associated with the `blue`/`green` Target Group by looking at the oddness/evenness of the _latest_ Launch Template version number _iff_ the Launch Template is changing upon the `terraform apply`. Correspondingly, the ASGs are suffixed with `-odd` and `-even`. Given latest Launch Template version number, if it is _odd_ the ASG suffixed as `-odd` will be chosen as `green` whereas if it is _even_ `-even` will be chosen as `green`. In this scenario, we expect no changes to the existing `blue` ASG nor its Target Group so that it continues to serve traffic uninterrupted.

<!-- BEGIN_TF_DOCS -->
<!-- GENERATED WITH `terraform-docs .`
Manually updating the README.md will be overwritten.
Expand Down Expand Up @@ -61,13 +71,13 @@ terraform apply
| [aws_caller_identity.current](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/caller_identity) | data source |
| [aws_ec2_managed_prefix_list.jenkins](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/ec2_managed_prefix_list) | data source |
| [aws_ec2_managed_prefix_list.vpn](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/ec2_managed_prefix_list) | data source |
| [aws_s3_bucket.logs](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/s3_bucket) | data source |
| [aws_security_group.remote](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/security_group) | data source |
| [aws_security_group.tools](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/security_group) | data source |
| [aws_security_group.vpn](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/security_group) | data source |
| [aws_security_groups.aurora_cluster](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/security_groups) | data source |
| [aws_ssm_parameters_by_path.nonsensitive_common](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/ssm_parameters_by_path) | data source |
| [aws_ssm_parameters_by_path.nonsensitive_service](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/ssm_parameters_by_path) | data source |
| [aws_ssm_parameters_by_path.sensitive_service](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/ssm_parameters_by_path) | data source |
| [aws_vpc.main](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/vpc) | data source |
| [aws_vpc.mgmt](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/vpc) | data source |
| [aws_vpc_peering_connection.peers](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/vpc_peering_connection) | data source |
Expand Down
10 changes: 5 additions & 5 deletions ops/terraform/services/server/data-sources.tf
Original file line number Diff line number Diff line change
Expand Up @@ -47,11 +47,6 @@ data "aws_ami" "main" {
}
}

# s3 buckets
data "aws_s3_bucket" "logs" {
bucket = "bfd-${local.env}-logs-${data.aws_caller_identity.current.account_id}"
}

# aurora security group
data "aws_security_groups" "aurora_cluster" {
filter {
Expand Down Expand Up @@ -114,3 +109,8 @@ data "aws_ssm_parameters_by_path" "nonsensitive_common" {
data "aws_ssm_parameters_by_path" "nonsensitive_service" {
path = "/bfd/${local.env}/${local.service}/nonsensitive"
}

data "aws_ssm_parameters_by_path" "sensitive_service" {
path = "/bfd/${local.env}/${local.service}/sensitive"
with_decryption = true
}
60 changes: 49 additions & 11 deletions ops/terraform/services/server/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,15 @@ locals {
for key, value in local.nonsensitive_service_map
: split("/", key)[5] => value
}
sensitive_service_map = zipmap(
data.aws_ssm_parameters_by_path.sensitive_service.names,
nonsensitive(data.aws_ssm_parameters_by_path.sensitive_service.values)
)
sensitive_service_config = {
for key, value in local.sensitive_service_map
: split("/", key)[5] => value
}


enterprise_tools_security_group = local.nonsensitive_common_config["enterprise_tools_security_group"]
management_security_group = local.nonsensitive_common_config["management_security_group"]
Expand All @@ -43,10 +52,10 @@ locals {
ssh_key_pair = local.nonsensitive_common_config["key_pair"]
vpc_name = local.nonsensitive_common_config["vpc_name"]

lb_is_public = local.nonsensitive_service_config["lb_is_public"]
lb_ingress_port = local.nonsensitive_service_config["lb_ingress_port"]
lb_egress_port = local.nonsensitive_service_config["lb_egress_port"]
lb_vpc_peerings = jsondecode(local.nonsensitive_service_config["lb_vpc_peerings_json"])
lb_is_public = tobool(local.nonsensitive_service_config["lb_is_public"])
lb_blue_ingress_port = local.nonsensitive_service_config["lb_blue_ingress_port"]
lb_green_ingress_port = local.nonsensitive_service_config["lb_green_ingress_port"]
lb_vpc_peerings = jsondecode(local.nonsensitive_service_config["lb_vpc_peerings_json"])

asg_min_instance_count = local.nonsensitive_service_config["asg_min_instance_count"]
asg_max_instance_count = local.nonsensitive_service_config["asg_max_instance_count"]
Expand All @@ -60,6 +69,8 @@ locals {
launch_template_volume_throughput = local.nonsensitive_service_config["launch_template_volume_throughput"]
launch_template_volume_type = local.nonsensitive_service_config["launch_template_volume_type"]

service_port = local.sensitive_service_config["service_port"]

env_config = {
default_tags = local.default_tags,
vpc_id = data.aws_vpc.main.id,
Expand Down Expand Up @@ -92,40 +103,49 @@ module "fhir_iam" {

## NLB for the FHIR server (SSL terminated by the FHIR server)
#
# TODO: Remove bfd_server_lb module in BFD-3878
# TODO: Remove below code in BFD-3878
module "fhir_lb" {
count = !local.is_ephemeral_env ? 1 : 0
source = "./modules/bfd_server_lb"

env_config = local.env_config
role = local.legacy_service
layer = "dmz"
log_bucket = data.aws_s3_bucket.logs.id
is_public = local.lb_is_public

ingress = local.lb_is_public ? {
description = "Public Internet access"
port = local.lb_ingress_port
port = local.lb_blue_ingress_port
cidr_blocks = ["0.0.0.0/0"]
prefix_list_ids = []
} : {
description = "From VPN, VPC peerings, the MGMT VPC, and self"
port = local.lb_ingress_port
port = local.lb_blue_ingress_port
cidr_blocks = concat(data.aws_vpc_peering_connection.peers[*].peer_cidr_block, [data.aws_vpc.mgmt.cidr_block, data.aws_vpc.main.cidr_block])
prefix_list_ids = [data.aws_ec2_managed_prefix_list.vpn.id, data.aws_ec2_managed_prefix_list.jenkins.id]
}

egress = {
description = "To VPC instances"
port = local.lb_egress_port
port = local.service_port
cidr_blocks = [data.aws_vpc.main.cidr_block]
}
}

moved {
from = module.fhir_lb
to = module.fhir_lb[0]
}
# TODO: Remove above code in BFD-3878

# TODO: Update this module with new NLB metrics in BFD-3885
module "lb_alarms" {
count = local.create_server_lb_alarms ? 1 : 0

source = "./modules/bfd_server_lb_alarms"

load_balancer_name = module.fhir_lb.name
load_balancer_name = one(module.fhir_lb[*].legacy_clb_name)
app = "bfd"

# NLBs only have this metric to alarm on
Expand All @@ -136,7 +156,6 @@ module "lb_alarms" {
}
}


## Autoscale group for the FHIR server
#
module "fhir_asg" {
Expand All @@ -146,9 +165,13 @@ module "fhir_asg" {
env_config = local.env_config
role = local.legacy_service
layer = "app"
lb_config = module.fhir_lb.lb_config
seed_env = local.seed_env

# TODO: Remove below code in BFD-3878
legacy_clb_name = one(module.fhir_lb[*].legacy_clb_name)
legacy_sg_id = one(module.fhir_lb[*].legacy_sg_id)
# TODO: Remove above code in BFD-3878

# Initial size is one server per AZ
asg_config = {
min = local.asg_min_instance_count
Expand Down Expand Up @@ -186,6 +209,21 @@ module "fhir_asg" {
remote_sg = data.aws_security_group.remote.id
ci_cidrs = [data.aws_vpc.mgmt.cidr_block]
}

lb_config = {
is_public = local.lb_is_public
enable_deletion_protection = !local.is_ephemeral_env
ingress = {
blue_port = local.lb_blue_ingress_port
green_port = local.lb_green_ingress_port
cidr_blocks = !local.lb_is_public ? concat(data.aws_vpc_peering_connection.peers[*].peer_cidr_block, [data.aws_vpc.mgmt.cidr_block, data.aws_vpc.main.cidr_block]) : ["0.0.0.0/0"]
prefix_list_ids = !local.lb_is_public ? [data.aws_ec2_managed_prefix_list.vpn.id, data.aws_ec2_managed_prefix_list.jenkins.id] : []
}
egress = {
cidr_blocks = [data.aws_vpc.main.cidr_block]
}
server_listen_port = local.service_port
}
}

## FHIR server logs
Expand Down
Loading
Loading