BFD-3313: Adjust autoscaling policies #2245

malessi · 2024-03-14T16:52:29Z

JIRA Ticket:
BFD-3313

User Story or Bug Summary:

As a US taxpayer, I want BFD to be as efficient as possible so that I can be confident that my tax dollars are not going to waste.

What Does This PR Do?

This PR updates the BFD Server's ASG scaling policies to scale based upon average CPU utilization across the ASG. These new scaling policies are designed keeping in mind that:

BFD Server typically sees large bulk loads, rather than consistent, sustained traffic
Average CPU utilization is inversely proportional to the number of instances in the ASG
- As in, the average utilization goes down as the number of instances increase
Step Scaling policies are based upon CloudWatch Alarms that can remain in ALARM (and thus cause scaling) after scale-out begins but before the scaled-out instances contribute to the average CPU utilization metric
- This leads to feedback where, with traditional Step Scaling policies tuned to respond aggressively, the ASG will scale-out indefinitely to the maximum
- This is largely due to Step Scaling policies being designed for consistent, sustained traffic that may increase in intensity gradually. BFD Server does not typically see this behavior

With the above assumptions in mind, this PR implements a bespoke CPU-based scaling solution that can be aggressive while not falling victim to the aforementioned erroneous scaling behavior.

Specifically, there are now ranges of CPU that we scale on:

0% to 50% CPU: No scaling
50% to 75% CPU:
- At 3 instances, scale to 6
- At 6 instances, scale to 9
- At 9 instances, scale to 12
75% to 90% CPU:
- At 3 instances, scale to 9
- At 6 instances, scale to 12
- At 9 instances, scale to 12
90% CPU and above:
- Scale to 12, regardless of current instance count

These ranges, and corresponding scaling actions, are designed to allow the Server to respond aggressively to large incoming workloads while considering the inverse proportionality of CPU and instance count.

This PR has been verified by:

terraform applying to test, verifying that it applies properly and all policies/Alarms are created
Using the Regression Suite to generate load with policies configured with much lower bounds (to simulate higher load), verifying that:
- the Server scales
- the Server does not erroneously scale beyond what is expected
- the Server scales-in after load has ceased, as expected

What Should Reviewers Watch For?

If you're reviewing this PR, please check for these things in particular:

Verify all PR security questions and checklists have been completed and addressed.

What Security Implications Does This PR Have?

Submitters should complete the following questionnaire:

If the answer to any of the questions below is Yes, then you must supply a link to the associated Security Impact Assessment (SIA), security checklist, or other similar document in Confluence here: N/A
- Does this PR add any new software dependencies?
  - Yes
  - No
- Does this PR modify or invalidate any of our security controls?
  - Yes
  - No
- Does this PR store or transmit data that was not stored or transmitted before?
  - Yes
  - No
If the answer to any of the questions below is Yes, then please add @StewGoin as a reviewer, and note that this PR should not be merged unless/until he also approves it.
- Do you think this PR requires additional review of its security implications for other reasons?
  - Yes
  - No

What Needs to Be Merged and Deployed Before this PR?

This PR cannot be either merged or deployed until the following prerequisite changes have been fully deployed:

N/A

Submitter Checklist

I have gone through and verified that...:

I have named this PR and branch so they are automatically linked to the (most) relevant Jira issue. Ie: BFD-123: Adds foo
This PR is reasonably limited in scope, to help ensure that:
1. It doesn't unnecessarily tie a bunch of disparate features, fixes, refactorings, etc. together.
2. There isn't too much of a burden on reviewers.
3. Any problems it causes have a small "blast radius".
4. It'll be easier to rollback if that becomes necessary.
This PR includes any required documentation changes, including README updates and changelog / release notes entries.
The data dictionary has been updated with any field mapping changes, if any were made.
All new and modified code is appropriately commented, such that the what and why of its design would be reasonably clear to engineers, preferably ones unfamiliar with the project.
All tech debt and/or shortcomings introduced by this PR are detailed in TODO and/or FIXME comments, which include a JIRA ticket ID for any items that require urgent attention.
Reviews are requested from both:
- At least two other engineers on this project, at least one of whom is a senior engineer or owns the relevant component(s) here.
- Any relevant engineers on other projects (e.g. DC GEO, BB2, etc.).
Any deviations from the other policies in the DASG Engineering Standards are specifically called out in this PR, above.
- Please review the standards every few months to ensure you're familiar with them.

…period to 10 minutes instead of 15

brandoncruz3

👍

mjburling

Indeed, this ought to allow us to run BFD more efficiently and I think the scaling criteria/thresholds seem reasonable. Looking forward to the results.

Thanks, Mitch!

malessi requested review from keithdadkins, mjburling, brandoncruz3, dondevun and meliGuzman as code owners March 14, 2024 16:52

malessi marked this pull request as draft March 14, 2024 16:52

Mitch Alessio added 4 commits March 15, 2024 15:10

Switch to RequestCount as scaling metric; reduce scale-in protection …

50b4ce2

…period to 10 minutes instead of 15

Fix scaling range bounds for 12 and 9 instances

4ff41d3

Switch to specialized CPU-based scaling policies

b46819c

Fix various issues with scaling policy definitions

5030959

malessi force-pushed the alessio/BFD-3313__fix-server-scaling-policies branch from 0f460be to 5030959 Compare March 15, 2024 19:27

dondevun approved these changes Mar 15, 2024

View reviewed changes

malessi marked this pull request as ready for review March 15, 2024 19:42

brandoncruz3 approved these changes Mar 15, 2024

View reviewed changes

mjburling approved these changes Mar 18, 2024

View reviewed changes

mjburling merged commit 445e735 into master Mar 18, 2024
9 checks passed

mjburling deleted the alessio/BFD-3313__fix-server-scaling-policies branch March 18, 2024 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BFD-3313: Adjust autoscaling policies #2245

BFD-3313: Adjust autoscaling policies #2245

malessi commented Mar 14, 2024 •

edited by mjburling

Loading

brandoncruz3 left a comment

mjburling left a comment

BFD-3313: Adjust autoscaling policies #2245

BFD-3313: Adjust autoscaling policies #2245

Conversation

malessi commented Mar 14, 2024 • edited by mjburling Loading

What Does This PR Do?

What Should Reviewers Watch For?

What Security Implications Does This PR Have?

What Needs to Be Merged and Deployed Before this PR?

Submitter Checklist

brandoncruz3 left a comment

Choose a reason for hiding this comment

mjburling left a comment

Choose a reason for hiding this comment

malessi commented Mar 14, 2024 •

edited by mjburling

Loading