Skip to content

Latest commit

 

History

History
403 lines (327 loc) · 21 KB

TECHNICAL_DETAILS.md

File metadata and controls

403 lines (327 loc) · 21 KB

Technical Details

Summary of supported features & options

Option Status Override using ASG tags
Run on multiple regions ✅ (default: all)
Keep a fixed minimum percentage of on-demand ✅ (default: 0%)
Keep a fixed minimum number of on-demand ✅ (default: 0)
Bid at a certain percentage of the on-demand price ✅ (default: 100%)
Can bid the current spot price plus a certain percentage
Automatically determine the cheapest compatible instance type ✅ (default)
Can restrict to the same instance type only ✅ - use current for the allowed instance types ✅ - use current for the allowed instance types
Can restrict to only certain instance types
Blacklisting of certain instance types
Filter on multiple & custom group tags ✅ (default: spot-enabled=true)
Configurable filtering modes(opt-in and opt-out) ✅ (default: opt-in)
Set a desired spot product name ❌ 🔧 - install multiple stacks, each with its own spot product
Configurable spot termination notification action ✅ (Only available when installed using CloudFormation) ✅ (Only available when installed via CloudFormation)

For the options not directly linked to any specific part of the doc, please check the configuration page.

Feature Status
Easy installation via Cloudformation
Easy installation via Terraform
Available as Docker container image ✅ 🔧
Installable as Kubernetes cron job ✅ 🔧
Helm chart available as well ✅ 🔧
Works with Code Deploy ✅ 🔧
Works with Elastic Beanstalk
Support AWS VPC
Support AWS EC2Classic 🔧 - unsupported instance types need to be explicitely blacklisted
Support AWS DefaultVPC
Automatically handles the spot termination signal ✅ (Only available when installed using CloudFormation)
Do not process AutoScaling groups while the CloudFormation stack that created them is in progress

| Desired missing features | Status | | Lambda X-Ray support | ❌ | | Graphing savings | ❌ 🔧 - use the Billing dashboard | | Cleaner Windows support | 🔧 - set the proper Spot product on the stack | | SNS notifications on success/failure | ❌ |

Meaning of the above icons

  • ✅ - supported and known to work well so far
  • ❌ - not supported but its implementation has been considered or is awaiting code contributions
  • ➖ - not applicable, or already part of the default behavior.
  • 🪲 - implemented but experimental or known to be buggy
  • 🔧 - may require some workarounds, for example it may be done with external tooling or may need additional configuration on your infrastructure
  • 📝 - a workaround or complete fix can be implemented in a custom/hardcoded fork with relatively little effort, but a proper fix ready to be upstreamed needs more work.

Some of them can be clicked for more information, you can see if you hover them with your mouse pointer, but if you have any questions you can always get in touch on Gitter.

Features and Benefits

  • Significant cost savings compared to on-demand or reserved instances

    • up to 90% cost reduction compared to on-demand instances.
    • up to 75% cost reduction compared to reserved instances, without any down-payment or long term commitment.
  • Easy to install and set up on existing environments based on AutoScaling

    • you can literally get started within minutes.
    • only needs to be installed once, in a single region, and can handle all other regions without any additional configuration (but can also be restricted to just a few regions if desired).
    • easy to enable and disable for reverting to the initial configuration based on resource tagging, if you decide you don't want to use it anymore.
    • easy to automate migration of multiple existing stacks, simply using scripts that set the expected tags on multiple AutoScaling groups.
  • Designed for use against AutoScaling groups with relatively long-running instances

    • for use cases where it is acceptable to run on-demand instances from time to time.
    • for short-term batch processing use cases you should have a look into the spot blocks instead.
  • It doesn't interfere with the group's original launch configuration

    • any instance replacement or scaling done by AutoScaling would still launch your previously configured on-demand instances.
    • on-demand instances often launch faster than spot ones so you don't need to wait for potentially slower spot instance fulfilment when you need to scale out or when you eventually lose some of the spot capacity.
  • Supports any higher level AWS services internally backed by AutoScaling

    • services such as ECS or Elastic Beanstalk work out of the box with minimal configuration changes or tweaks.
  • Compatible out of the box with most AWS services that integrate with AutoScaling groups

    • services such as ELB, ALB, CodeDeploy, CloudWatch, etc. should work out of the box or at most require minimal configuration changes.
    • as long as they support instances attached later to existing groups.
    • any other 3rd party services that run on top of AutoScaling groups should work as well.
  • Can automatically replace any instance types with any instance types available on the spot market

    • as long as they are cheaper and at least as big as the original instances.
    • it doesn't matter if the original instance is available on the spot market: for example it is often replacing t2.medium with better m4.large instances, as long as they happen to be cheaper.
  • Self-hosted

    • has no runtime dependencies on external infrastructure except for the regional EC2 and AutoScaling API endpoints.
    • it's not a SaaS, it fully runs within your AWS account.
    • it doesn't gather/persist/export any information about the resources running in your AWS account.
  • Free and open source

    • there are no service fees at install time or run time.
    • you only pay for the small runtime costs it generates.
    • open source, so it is fully auditable and you can see the logs of everything it does.
    • the code is relatively small and simple so in case of bugs or missing features you may even be able to fix it yourself.
  • Negligible runtime costs

    • you only pay for the bandwidth consumed performing API calls against AWS services across different regions.
    • backed by Lambda, with typical monthly execution time well within the Lambda free tier plan.
  • Minimalist and simple implementation

    • currently about 1000 CLOC of relatively readable Golang code.
    • stateless, and without many moving parts.
    • leveraging and relying on battle-tested AWS services - namely AutoScaling - for most mission-critical things, such as instance health checks, horizontal scaling, replacement of terminated instances, integration with, ELB, ALB and CloudWatch.
  • Relatively safe and secure

    • most runtime failures or crashes(quite rare nowadays) tend to be harmless.
    • often only result in failing to start new spot instances so your group will simply remain or fall back to on-demand capacity, just as it was before.
    • in most cases it is not impacting your running instances nor the ability to launch new ones.
    • only needs the minimum set of IAM permissions needed for it to do its job.
    • does not delegate any IAM permissions to resources outside of your AWS account.
    • execution scope can be limited to a certain set of regions.
  • Optimizes for high availability over cost whenever possible

    • it tries to diversify the instance types to reduce the chance of simultaneous failures across the entire group. When having enough desired capacity, it is often spreading over four different spot pricing zones (instance type/availability zone combinations).
    • supports keeping a configurable number of on-demand instances in the group, either an absolute number or a percentage of the instances from the group.
  • Automatically handles the spot termination notifications

    • see the dedicated section below for more details

Replacement logic

Once enabled on an AutoScaling group, it is gradually replacing all the on-demand instances belonging to the group with compatible and similarly configured but cheaper spot instances.

The replacements are done using the relatively new Attach/Detach actions supported by the AutoScaling API. A new compatible spot instance is launched, and it will be immediately attached to the group, while at the same time an on-demand instance is detached from the group and terminated in order to keep the group at constant capacity.

When assessing the compatibility, it takes into account the hardware specs, such as CPU cores, RAM size, attached instance store volumes and their type and size, as well as the supported virtualization types (HVM or PV) of both instance types. The new spot instance is usually a few times cheaper than the original instance, while also often providing more computing capacity.

The new spot instance is configured with the same roles, security groups and tags and set to execute the same user data script as the original instance, so from a functionality perspective it should be indistinguishable from other instances in the group, although its hardware specs may be slightly different(again: at least the same, but often can be of bigger capacity).

When replacing multiple instances in a group, the algorithm tries to use a wide variety of instance types, in order to reduce the probability of simultaneous failures that may impact the availability of the entire group. It always tries to launch the cheapest available compatible instance type, but if the group already has a considerable amount of instances of that type in the same availability zone (currently more than 20% of the group's capacity is in that zone and of that instance type), it picks the second cheapest compatible instance, and so on.

During multiple replacements performed on a given group, it only swaps them one at a time per Lambda function invocation, in order to not change the group too fast, but instances belonging to multiple groups can be replaced concurrently. If you find this slow, the Lambda function invocation frequency (defaulting to once every 30 minutes) can be changed by updating the stack, which has a parameter for it.

In the (so far unlikely) case in which the market price is high enough that there are no spot instances that can be launched, (and also in case of software crashes which may still rarely happen), the group would not be changed and it would keep running as it is, but AutoSpotting will continuously attempt to replace them, until eventually the prices decrease again and replacements may succeed again.

Internal components

When deployed, the software consists on a number of resources running in your Amazon AWS account, created automatically with CloudFormation or Terraform:

Event generator

CloudWatch event source used for triggering the Lambda function. The default frequency is every 30 minutes, but it is configurable using stack parameters.

Lambda function

  • AWS Lambda function connected to the event generator, which triggers it periodically.
  • It has assigned a IAM role and policy with a set of permissions to call the APIs of various AWS services(EC2 and AutoScaling for now) within the user's account.
  • The permissions are the minimal set required for it to work without the need of passing any explicit AWS credentials or access keys.
  • Some algorithm parameters can be configured using Lambda environment variables, based on some of the stack parameters.
  • Contains a handler written in Golang which implements all the instance replacement logic.
  • The spot instances are created by duplicating the configuration of the currently running on-demand instances as closely as possible(IAM roles, security groups, user_data script, etc.), maybe changing the instance type to a usually bigger, but compatible one.
  • The Spot price is set by default to the on-demand price of the instances configured initially on the AutoScaling group, but this is configurable.
  • The new launch configuration may also have a different instance type, determined based on compatibility with the original instance type, considering also how much redundancy we need to have in place in the current availability zone, in order to survive instance termination when outbid for a certain instance type.

Regional spot termination stacks

  • Additional CloudFormation stacks automatically deployed in every region when installing the main CloudFormation stack (currently not supported when installing using Terraform)
  • Install a few regional components (SNS topic, CloudWatch event rule, regional Lambda function, etc.) configured to trigger the main Lambda function deployed in us-east-1 when instances in the current region are about to be terminated.
  • The main Lambda function will take action based on these events. By default it will terminate the instance (executing its termination lifecycle hooks) if these lifecycle hooks are defined. Alternatively it will detach the instance from its AutoScaling group which will detach it from the load balancer as early as possible.

Running example

Workflow

In this case the initial instance type was quite expensive, so the algorithm chose a different type that had more computing capacity. At the end that group had 3x more CPU cores and 66% more RAM than in the initial state of the group, and all this with 33% cost savings and without running entirely on spot instances, since some users find that a bit risky.

Nevertheless, AutoSpotting tends to be quite reliable even on all-spot configurations (has automated failover to on-demand nodes and spreads over multiple price zones), where it can often achieve savings up to 90% off the usual on-demand prices, much like in the 85% price reduction shown below. This was seen on a group of two m3.medium instances running in eu-west-1:

Savings Graph

Best Practices

These recommendations apply for most cloud environments, but they become especially important when using more volatile spot instances.

  • Set a non-zero grace period on the AutoScaling group

    • in order to attach spot instances only after they are fully configured.
    • otherwise they may be attached prematurely before being ready.
    • they may also be terminated after failing load balancer health checks.
  • Check your instance storage and block device mapping configuration

    • this may become an issue if you use instances which have ephemeral instance storage, often the case on previous instance types.
    • you should only specify ephemeral instance store in the on-demand launch configuration if you do make use of it by mounting it on the filesystem.
    • the replacement algorithm tries to give you instances with as much instance storage as your original instances, since it can't tell if you did mount it.
    • this adds more constraints on the algorithm, so it reduces the number of compatible instance types it can use for launching spot instances.
    • this is fine if you actually use that instance storage, but it is reducing your options if you don't actually use it, so it may more often fail to get spot instances and fall back to on-demand capacity.
  • Don't keep state on instances

    • You should delegate all your state to external services, AWS has a wide offering of stateful services which allow your instances to become stateless.
      • Databases: RDS, DynamoDB
      • Caches: ElastiCache
      • Storage: S3, EFS
      • Queues: SQS
    • Don't attach EBS volumes to individual instances, try to use EFS instead.
  • Handle the spot instance termination signal

    • See the next section for more detailed instructions.

Spot termination notifications

EC2 Metadata

AWS notifies your spot instances when they are about to be terminated by setting a dedicated metadata field, so you can make use of that information to save whatever temporary state you may still have on your running spot instances or to gracoiusly remove them from the group.

This information is only visible from within your instances, so AutoSpotting won't have any visibility on it to take any action.

Fortunately, there are existing third party tools such as seespot which you can run yourself, implementing such a termination notification handler.

This will need to be integrated into your user_data script, for more details you can read see the seespot tool's documentation.

Pros

  • you have full control over what the instance can execute before being terminated

Cons

  • requires some configuration changes on all your instances

CloudWatch events

In addition, AWS also generates CloudWatch events for these termination notifications. AutoSpotting will automatically intercept these events and proactively tries to take some draining actions immediately.

These actions consist in executing the termination lifecycle hooks, if present, or alternatively detaching the soon to be terminated instances from their AutoScaling group, which in turn will detach them from the load balancer configured on the group. This should be relatively graceful if you use connection draining on the load balancer.

Pros

  • doesn't require any configuration changes
  • instances behind ELBs are detached automatically (or start to be drained) as soon as the imminent spot termination event is received.
  • if you already have lifecycle hooks they will be executed, but in this case we can't detach the instances, so you may need to do this from within the lifecycle hook logic.
  • this action can also be overridden on a per group basis using tags, if you need to.

Cons

  • Less flexible, you will need to have lifecycle hooks if you need to run some complex logic when terminating the instances.
  • Currently only supported when using the CloudFormation installation method.

Instances behind an ELB

Instances behind an ELB can be graciously removed from the load balancer without losing connections. You should enable the connection draining feature.

As mentioned above, AutoSpotting will automatically detach them from the load balancer unless you have termination lifecycle hooks configured on your AutoScaling group. Note: this is currently only supported when AutoSpotting is installed using CloudFormation.

ECS container hosts

The container hosts can be drained in a similar way, by migrating all the Docker containers to the other hosts from your cluster before the spot instance is terminated. This blog post explains it in great detail, until AWS hopefully implements this out of the box.