Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BFD-3313: Adjust autoscaling policies #2245

Merged
merged 4 commits into from
Mar 18, 2024

Conversation

malessi
Copy link
Contributor

@malessi malessi commented Mar 14, 2024

JIRA Ticket:
BFD-3313

User Story or Bug Summary:

As a US taxpayer, I want BFD to be as efficient as possible so that I can be confident that my tax dollars are not going to waste.


What Does This PR Do?

This PR updates the BFD Server's ASG scaling policies to scale based upon average CPU utilization across the ASG. These new scaling policies are designed keeping in mind that:

  • BFD Server typically sees large bulk loads, rather than consistent, sustained traffic
  • Average CPU utilization is inversely proportional to the number of instances in the ASG
    • As in, the average utilization goes down as the number of instances increase
  • Step Scaling policies are based upon CloudWatch Alarms that can remain in ALARM (and thus cause scaling) after scale-out begins but before the scaled-out instances contribute to the average CPU utilization metric
    • This leads to feedback where, with traditional Step Scaling policies tuned to respond aggressively, the ASG will scale-out indefinitely to the maximum
    • This is largely due to Step Scaling policies being designed for consistent, sustained traffic that may increase in intensity gradually. BFD Server does not typically see this behavior

With the above assumptions in mind, this PR implements a bespoke CPU-based scaling solution that can be aggressive while not falling victim to the aforementioned erroneous scaling behavior.

Specifically, there are now ranges of CPU that we scale on:

  • 0% to 50% CPU: No scaling
  • 50% to 75% CPU:
    • At 3 instances, scale to 6
    • At 6 instances, scale to 9
    • At 9 instances, scale to 12
  • 75% to 90% CPU:
    • At 3 instances, scale to 9
    • At 6 instances, scale to 12
    • At 9 instances, scale to 12
  • 90% CPU and above:
    • Scale to 12, regardless of current instance count

These ranges, and corresponding scaling actions, are designed to allow the Server to respond aggressively to large incoming workloads while considering the inverse proportionality of CPU and instance count.

This PR has been verified by:

  • terraform applying to test, verifying that it applies properly and all policies/Alarms are created
  • Using the Regression Suite to generate load with policies configured with much lower bounds (to simulate higher load), verifying that:
    • the Server scales
    • the Server does not erroneously scale beyond what is expected
    • the Server scales-in after load has ceased, as expected

What Should Reviewers Watch For?

If you're reviewing this PR, please check for these things in particular:

  • Verify all PR security questions and checklists have been completed and addressed.

What Security Implications Does This PR Have?

Submitters should complete the following questionnaire:

  • If the answer to any of the questions below is Yes, then you must supply a link to the associated Security Impact Assessment (SIA), security checklist, or other similar document in Confluence here: N/A

    • Does this PR add any new software dependencies?
      • Yes
      • No
    • Does this PR modify or invalidate any of our security controls?
      • Yes
      • No
    • Does this PR store or transmit data that was not stored or transmitted before?
      • Yes
      • No
  • If the answer to any of the questions below is Yes, then please add @StewGoin as a reviewer, and note that this PR should not be merged unless/until he also approves it.

    • Do you think this PR requires additional review of its security implications for other reasons?
      • Yes
      • No

What Needs to Be Merged and Deployed Before this PR?

This PR cannot be either merged or deployed until the following prerequisite changes have been fully deployed:

  • N/A

Submitter Checklist

I have gone through and verified that...:

  • I have named this PR and branch so they are automatically linked to the (most) relevant Jira issue. Ie: BFD-123: Adds foo
  • This PR is reasonably limited in scope, to help ensure that:
    1. It doesn't unnecessarily tie a bunch of disparate features, fixes, refactorings, etc. together.
    2. There isn't too much of a burden on reviewers.
    3. Any problems it causes have a small "blast radius".
    4. It'll be easier to rollback if that becomes necessary.
  • This PR includes any required documentation changes, including README updates and changelog / release notes entries.
  • The data dictionary has been updated with any field mapping changes, if any were made.
  • All new and modified code is appropriately commented, such that the what and why of its design would be reasonably clear to engineers, preferably ones unfamiliar with the project.
  • All tech debt and/or shortcomings introduced by this PR are detailed in TODO and/or FIXME comments, which include a JIRA ticket ID for any items that require urgent attention.
  • Reviews are requested from both:
    • At least two other engineers on this project, at least one of whom is a senior engineer or owns the relevant component(s) here.
    • Any relevant engineers on other projects (e.g. DC GEO, BB2, etc.).
  • Any deviations from the other policies in the DASG Engineering Standards are specifically called out in this PR, above.
    • Please review the standards every few months to ensure you're familiar with them.

@malessi malessi force-pushed the alessio/BFD-3313__fix-server-scaling-policies branch from 0f460be to 5030959 Compare March 15, 2024 19:27
@malessi malessi marked this pull request as ready for review March 15, 2024 19:42
Copy link
Contributor

@brandoncruz3 brandoncruz3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@mjburling mjburling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this ought to allow us to run BFD more efficiently and I think the scaling criteria/thresholds seem reasonable. Looking forward to the results.

Thanks, Mitch!

@mjburling mjburling merged commit 445e735 into master Mar 18, 2024
9 checks passed
@mjburling mjburling deleted the alessio/BFD-3313__fix-server-scaling-policies branch March 18, 2024 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants