Skip to content

Commit

Permalink
Merge pull request #83 from Cameronsplaze/optimization/combine-hosted…
Browse files Browse the repository at this point in the history
…-zones

[optimization/combine hosted zones] Moved HostedZone from Leaf Stack, to Base Stack
  • Loading branch information
Cameronsplaze authored Dec 8, 2024
2 parents 07838e2 + fdc4539 commit 172ce48
Show file tree
Hide file tree
Showing 17 changed files with 310 additions and 254 deletions.
13 changes: 6 additions & 7 deletions ContainerManager/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@

This is designed so you only need one base stack that you deploy first, then you can deploy any number of leaf stacks on it. This lets you modify one leaf/container stack, without affecting the rest, and still have shared resources to reduce cost/complexity where appropriate.

**Note**: The word `Stack` is overloaded here. Both the "base" and "leaf" stacks each contain two stacks inside them (in different regions). There's just no better word, "app" is the entire project.

- The [./leaf_stack](./leaf_stack/README.md) is what runs a single container. One `leaf_stack` for one container.
- The [base_stack.py](./base_stack.py) is common architecture that different containers can share (i.e VPC). Multiple "Leaf Stacks" can point to the same "Base Stack".
- The [./base_stack](./base_stack/README.md) is common architecture that different containers can share (i.e VPC, HostedZone). Multiple "Leaf Stacks" can point to the same "Base Stack".
- The [./utils](./utils/README.md) are functions that don't fit in the other two. Mainly config readers/parsers.

Click here to jump to '[Base Stack Config Options](#base-stack-config-options)'. It's the last section, since it's the longest.
Expand All @@ -17,16 +19,13 @@ Click here to jump to '[Base Stack Config Options](#base-stack-config-options)'.

The system is designed all around the Auto Scaling Group (ASG). This way, if the ASG spins up in any way (DNS query comes in, or you change the desired_count in the console), everything spins up around it. If a alarm triggers, it just has to spin the ASG back down and everything will naturally follow suit.

See the [leaf_stack README.md](./leaf_stack/README.md) for more info.
See the [leaf_stack'S README.md](./leaf_stack/README.md) for more info.

## Base Stack Summary

The [base stack](./base_stack.py) is the common architecture that different containers can share. Most notably:
The [base stack](./base_stack/README.md) is broken into two components (or "stacks"). One *must* be in us-east-1 for Route53, and the other has to be in the same region as you want to run the containers from.

- **VPC**: The overall network for all the containers and EFS. We used a public VPC, because private cost ~$32/month per subnet (because of the NAT). WITH ec2 costs, I want to shoot for about than $100/year with solid usage.
- **SSH Key Pair**: The key pair to SSH into the EC2 instances. Keeping it here lets you get into all the leaf_stacks without having to log into AWS each time you deploy a new leaf. If you destroy and re-build the leaf, this keeps the key consistent too.
- **SNS Notify Logic**: Designed for things admin would care about. This tells you whenever the instance spins up or down, if it runs into errors, etc.
- **Route53**: The base domain name for all stacks to build from.
Anything that can be here instead of the leaf stacks, should be.

## Base Stack Config Options

Expand Down
17 changes: 17 additions & 0 deletions ContainerManager/base_stack/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Base Stack Summary

This is common architecture between leaf-stacks, combined to reduce costs and complexity.

## Base Stack Main ([main.py](./main.py))

Deployed to the same region as you want to run the containers from.

- **VPC**: The overall network for all the containers and EFS. We used a public VPC, because private cost ~$32/month per subnet (because of the NAT). WITH ec2 costs, I want to shoot for less than $100/year with solid usage.
- **SSH Key Pair**: The key pair to SSH into the EC2 instances. Keeping it here lets you get into all the leaf_stacks without having to log into AWS each time you deploy a new leaf. If you destroy and re-build the leaf, this keeps the key consistent too.
- **SNS Notify Logic**: Designed for things admin would care about. This tells you whenever the instance spins up or down, if it runs into errors, etc.

## Base Stack Domain ([domain.py](./domain.py))

Deployed to `us-east-1`, since Route53 logs can only go there.

- **Route53 HostedZone**: The base domain for all the leaf stacks. This is where the DNS records will be created for each leaf stack. The leaf stacks add their DNS record to this, and watch this log group for when their specific DNS record gets a query.
7 changes: 7 additions & 0 deletions ContainerManager/base_stack/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"""
The different components of the base stack, broken
apart since they're in different regions.
"""

from .main import BaseStackMain
from .domain import BaseStackDomain
97 changes: 97 additions & 0 deletions ContainerManager/base_stack/domain.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@

"""
This module contains the BaseStackDomain class.
"""

from constructs import Construct
from aws_cdk import (
Stack,
RemovalPolicy,
aws_route53 as route53,
aws_logs as logs,
aws_iam as iam,
)

# from cdk_nag import NagSuppressions


class BaseStackDomain(Stack):
"""
Contains shared resources for all leaf stacks.
Most importantly, the hosted zone.
"""
def __init__(
self,
scope: Construct,
construct_id: str,
config: dict,
**kwargs,
) -> None:
super().__init__(scope, construct_id, **kwargs)


#####################
### Route53 STUFF ###
#####################
### These are also imported to other stacks, so save them here:
self.domain_name = config["Domain"]["Name"]
## The instance isn't up, use the "unknown" ip address:
# https://www.lifewire.com/four-zero-ip-address-818384
self.unavailable_ip = "0.0.0.0"
## Never set TTL to 0, it's not defined in the standard
# (Since the container is constantly changing, update DNS asap)
self.dns_ttl = 1
self.record_type = route53.RecordType.A


## Log group for the Route53 DNS logs:
# https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_logs.LogGroup.html
self.route53_query_log_group = logs.LogGroup(
self,
"QueryLogGroup",
log_group_name=f"/aws/route53/{construct_id}-query-logs",
# Only need logs to trigger the lambda, don't need long-term:
retention=logs.RetentionDays.ONE_DAY,
removal_policy=RemovalPolicy.DESTROY,
)
## You can't grant direct access after creating the sub_hosted_zone, since it needs to
# write to the log group when you create the zone. AND you can't do a wildcard arn, since the
# account number isn't in the arn.
self.route53_query_log_group.grant_write(iam.ServicePrincipal("route53.amazonaws.com"))

## The subdomain for the Hosted Zone:
# https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_route53.PublicHostedZone.html
self.hosted_zone = route53.PublicHostedZone(
self,
"HostedZone",
zone_name=self.domain_name,
query_logs_log_group_arn=self.route53_query_log_group.log_group_arn,
comment=f"{construct_id}: DNS query for all containers.",
)

## If you bought a domain through AWS, and have an existing Hosted Zone. We can't
# modify it, so we import it and tie ours to the existing one:
if config["Domain"]["HostedZoneId"]:
## Import the existing Route53 Hosted Zone:
# https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_route53.PublicHostedZoneAttributes.html
self.imported_hosted_zone = route53.PublicHostedZone.from_hosted_zone_attributes(
self,
"RootHostedZone",
hosted_zone_id=config["Domain"]["HostedZoneId"],
zone_name=self.domain_name,
)
else:
# This is checked in the leaf stack, to see if it needs to add
# a NS record to this hosted zone.
self.imported_hosted_zone = None

#####################
### Export Values ###
#####################
## To stop cdk from trying to delete the exports when cdk is deployed by
## itself, but still has leaf stacks attached to it.
# https://blogs.thedevs.co/aws-cdk-export-cannot-be-deleted-as-it-is-in-use-by-stack-5c205b8004b4
self.export_value(self.hosted_zone.hosted_zone_name_servers)
self.export_value(self.route53_query_log_group.log_group_arn)
self.export_value(self.hosted_zone.hosted_zone_id)
self.export_value(self.route53_query_log_group.log_group_name)
Original file line number Diff line number Diff line change
@@ -1,24 +1,22 @@

"""
This module contains the ContainerManagerBaseStack class.
This module contains the BaseStackMain class.
"""

from constructs import Construct
from aws_cdk import (
Stack,
RemovalPolicy,
aws_ec2 as ec2,
aws_route53 as route53,
aws_sns as sns,
aws_iam as iam,
)

from cdk_nag import NagSuppressions

# from .utils.get_param import get_param
from .utils.sns_subscriptions import add_sns_subscriptions
from ContainerManager.utils.sns_subscriptions import add_sns_subscriptions

class ContainerManagerBaseStack(Stack):
class BaseStackMain(Stack):
"""
Contains shared resources for all leaf stacks.
Most importantly, the VPC and SNS.
Expand All @@ -34,13 +32,6 @@ def __init__(
) -> None:
super().__init__(scope, construct_id, **kwargs)

### Fact-check the maturity, and save it for leaf stacks:
# (Makefile defaults to prod if not set. We want to fail-fast
# here, so throw if it doesn't exist)
self.maturity = self.node.get_context("maturity")
supported_maturities = ["devel", "prod"]
assert self.maturity in supported_maturities, f"ERROR: Unknown maturity. Must be in {supported_maturities}"

#################
### VPC STUFF ###
#################
Expand Down Expand Up @@ -102,34 +93,6 @@ def __init__(
)
add_sns_subscriptions(self, self.sns_notify_topic, config["AlertSubscription"])


#####################
### Route53 STUFF ###
#####################
# domain_name is imported to other stacks, so save it to this one:
self.domain_name = config["Domain"]["Name"]
self.root_hosted_zone_id = config["Domain"].get("HostedZoneId")

if config["Domain"]["HostedZoneId"]:
## Import the existing Route53 Hosted Zone:
# https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_route53.PublicHostedZoneAttributes.html
self.root_hosted_zone = route53.PublicHostedZone.from_hosted_zone_attributes(
self,
"RootHostedZone",
hosted_zone_id=config["Domain"]["HostedZoneId"],
zone_name=self.domain_name,
)
else:
## Create a Route53 Hosted Zone:
# https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_route53.PublicHostedZone.html
self.root_hosted_zone = route53.PublicHostedZone(
self,
"RootHostedZone",
zone_name=self.domain_name,
comment=f"Hosted zone for {construct_id}: {self.domain_name}",
)
self.root_hosted_zone.apply_removal_policy(RemovalPolicy.DESTROY)

#####################
### Export Values ###
#####################
Expand Down
21 changes: 11 additions & 10 deletions ContainerManager/leaf_stack/NestedStacks/AsgStateChangeHook.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@

from cdk_nag import NagSuppressions

from ContainerManager.leaf_stack.domain_stack import DomainStack
from ContainerManager.base_stack import BaseStackDomain

class AsgStateChangeHook(NestedStack):
"""
Expand All @@ -31,7 +31,8 @@ def __init__(
self,
scope: Construct,
container_id: str,
domain_stack: DomainStack,
container_url: str,
base_stack_domain: BaseStackDomain,
ecs_cluster: ecs.Cluster,
ec2_service: ecs.Ec2Service,
auto_scaling_group: autoscaling.AutoScalingGroup,
Expand Down Expand Up @@ -82,11 +83,11 @@ def __init__(
log_group=self.log_group_asg_statechange_hook,
role=self.asg_state_change_role,
environment={
"HOSTED_ZONE_ID": domain_stack.sub_hosted_zone.hosted_zone_id,
"DOMAIN_NAME": domain_stack.sub_domain_name,
"UNAVAILABLE_IP": domain_stack.unavailable_ip,
"DNS_TTL": str(domain_stack.dns_ttl),
"RECORD_TYPE": domain_stack.record_type.value,
"HOSTED_ZONE_ID": base_stack_domain.hosted_zone.hosted_zone_id,
"DOMAIN_NAME": container_url,
"UNAVAILABLE_IP": base_stack_domain.unavailable_ip,
"DNS_TTL": str(base_stack_domain.dns_ttl),
"RECORD_TYPE": base_stack_domain.record_type.value,
"ECS_CLUSTER_NAME": ecs_cluster.cluster_name,
"ECS_SERVICE_NAME": ec2_service.service_name,
},
Expand Down Expand Up @@ -120,20 +121,20 @@ def __init__(
resources=[ec2_service.service_arn],
)
)
## Let it update the DNS record of this stack:
## Let it update the DNS record of the base stack:
self.asg_state_change_policy.add_statements(
iam.PolicyStatement(
effect=iam.Effect.ALLOW,
actions=["route53:ChangeResourceRecordSets"],
resources=[domain_stack.sub_hosted_zone.hosted_zone_arn],
resources=[base_stack_domain.hosted_zone.hosted_zone_arn],
)
)

## EventBridge Rule: This is actually what hooks the Lambda to the ASG/Instance.
# Needed to keep the management in sync with if a container is running.
# https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_events.Rule.html
message_up = events.RuleTargetInput.from_text(
f"Container for '{container_id}' is starting up! Connect to it at: '{domain_stack.sub_domain_name}'.",
f"Container for '{container_id}' is starting up! Connect to it at: '{container_url}'.",
)
self.rule_asg_state_change_trigger_up = events.Rule(
self,
Expand Down
13 changes: 7 additions & 6 deletions ContainerManager/leaf_stack/NestedStacks/Dashboard.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
)
from constructs import Construct

from ContainerManager.leaf_stack.domain_stack import DomainStack
from ContainerManager.base_stack import BaseStackDomain
## Import the other Nested Stacks:
from . import Container, EcsAsg, Watchdog, AsgStateChangeHook

Expand All @@ -31,7 +31,8 @@ def __init__(
container_id: str,
main_config: dict,

domain_stack: DomainStack,
base_stack_domain: BaseStackDomain,
dns_log_query_filter: str,
container_nested_stack: Container,
ecs_asg_nested_stack: EcsAsg,
watchdog_nested_stack: Watchdog,
Expand Down Expand Up @@ -71,17 +72,17 @@ def __init__(
## Route53 DNS logs for spinning up the system:
# https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_cloudwatch.LogQueryWidget.html
cloudwatch.LogQueryWidget(
title=f"(DNS Traffic) Start's Up System - [{domain_stack.region}: {domain_stack.route53_query_log_group.log_group_name}]",
log_group_names=[domain_stack.route53_query_log_group.log_group_name],
region=domain_stack.region,
title=f"(DNS Traffic) Start's Up System - [{base_stack_domain.region}: {base_stack_domain.route53_query_log_group.log_group_name}]",
log_group_names=[base_stack_domain.route53_query_log_group.log_group_name],
region=base_stack_domain.region,
width=12,
height=4,
query_lines=[
# The message also contains the timestamp, remove it:
"fields @timestamp, substr(@message, 25) as message",
# Spaces on either side, just like SubscriptionFilter, to not
# trigger on the "_tcp" query that pairs with the normal one:
f"filter @message like /{domain_stack.log_dns_filter}/",
f"filter @message like /{dns_log_query_filter}/",
],
),

Expand Down
4 changes: 3 additions & 1 deletion ContainerManager/leaf_stack/NestedStacks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ This creates the Ecs Cluster/Service, AutoScaling Group, and EC2 Launch Template

### Watchdog

This monitors the container, and will spin down the ASG if any of it's alarms goes off. There are three alarms that trigger the scaling down of the ASG:
This monitors the container, and will spin down the ASG if any of it's alarms goes off.

There are three alarms that trigger the scaling down of the ASG:

Expand All @@ -50,6 +50,8 @@ This alarm will detect if the container unexpectedly stops for whatever reason,

The reason why we trigger sns off alarm, instead of the event rule directly, is because the rule can be triggered ~4 times before the lambda call finally spins down the ASG. That'd be ~4 emails at once. Also by having an alarm, we can add it to the dashboard for easy monitoring.

**NOTE:** The Mermaid graph shows this triggering by using the `Scale Down ASG Action`. I couldn't figure out how to make the lambda call an existing action, so instead it just spins down the ASG directly with a [boto3 set_desired_capacity](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/autoscaling/client/set_desired_capacity.html) call. It's easier to follow the graph if all three "scale down" actions are the same, and it's basically the same logic anyways. (I'm open to a PR if the logic ends up being simple. I think you might have to use a [put_scaling_policy](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/autoscaling/client/put_scaling_policy.html)? But idk how to actually trigger an existing one. What would be REALLY nice is if [Events Rule Target](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_events.IRuleTarget.html) added support for ASG desired count, and we could remove the lambda function all together.)

### AsgStateChangeHook

This component will trigger whenever the ASG instance state changes (i.e the one instance either spins up or down). This is used to keep the architecture simple, plus if you update the instance count in the console, everything will naturally update around it.
Expand Down
Loading

0 comments on commit 172ce48

Please sign in to comment.