Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent scheduling & speed of compaction tasks on EC2 #3329

Open
patchwork01 opened this issue Sep 20, 2024 · 3 comments
Open

Inconsistent scheduling & speed of compaction tasks on EC2 #3329

patchwork01 opened this issue Sep 20, 2024 · 3 comments
Milestone

Comments

@patchwork01
Copy link
Collaborator

patchwork01 commented Sep 20, 2024

Description

In the system test CompactionPerformanceST, we've noticed that sometimes compaction tasks are scheduled quite inconsistently between the EC2s that are created, and when they're scheduled to the same one they run much slower than expected. More EC2s seem to be created than we expect, and some EC2s are set up to run just a single compaction task.

The test repeatedly runs the compaction task creator until it finds 10 tasks are up, then creates 10 compaction jobs to run on them.

It looks like the compaction task creator lambda fails halfway through a number of times, and it runs about 9 times overall. It starts off thinking it needs 1 task per instance, then changes its mind after a few failures trying to actually create the tasks, and for the rest of the lambda invocations it decides it should have 4 containers per instance instead.

In one example of the test, we ended up with 6 compaction jobs each on their own dedicated EC2, and 4 compaction jobs all on the same EC2. The 4 that ran on the same instance ran at about 180,000 records per second, whereas the other 6 ran at about 300,000 records per second. The EC2 that had 4 compaction jobs ran at over 99.8% CPU utilization the whole time, where the other 6 EC2s ran at just over 25%.

Steps to reproduce

  1. Run CompactionPerformanceST
  2. Check compaction jobs report at the end
  3. See some tasks & jobs run much slower than others
  4. Find EC2 instance IDs in the compaction task logs in CloudWatch
  5. See that if an EC2 had 4 tasks, the tasks & jobs ran a lot slower than ones on other EC2s
  6. Check CPU utilization metric for EC2s in CloudWatch
  7. Compare utilization between EC2s with different numbers of tasks

Expected behaviour

The compaction task starter could wait for the autoscaler to apply the desired number of instances before it starts trying to run tasks. It's not clear how worthwhile this is though, as it will run again in a minute anyway.

The compaction task starter should choose the same number of containers per instance given the same settings. This can be calculated from the instance details as it is now, but it should be consistent. It might also be worth configuring the number of containers per instance directly in an instance property.

We need to decide how much of a problem it is that we see this level of performance degradation at full utilization. If 3 tasks on one EC2 maintain the 300,000 records per second, that would be 900,000/s for the whole EC2, compared to the 720,000/s we saw for 4 tasks on one EC2. If we make this more configurable this may be less of a problem. It might be worth noting in the documentation or the property descriptions that high utilization can result in this degradation, at least if we can confirm that is the cause.

Since the CPU is the bottleneck, we could also consider moving to CPU-optimised EC2 instances. At time of writing it's running on t3.xlarge.

@patchwork01 patchwork01 added the bug Something isn't working label Sep 20, 2024
@patchwork01 patchwork01 added this to the 0.26.0 milestone Sep 20, 2024
@gchq gchq deleted a comment from SAMBILI Sep 20, 2024
@m09526
Copy link
Member

m09526 commented Oct 3, 2024

Just read through all involved code and the very detailed (thank you!) description above. There's a few things to unpack here.

The three classes we need to care about are:

sleeper.compaction.task.creation.RunCompactionTasksLambda
sleeper.task.common.RunCompactionTasks 
sleeper.task.common.EC2Scaler

Short and not entirely helpful answer to this issue

  • Code is working entirely as intended. No bugs found. Everything is behaving as designed.

More detailed answer

The scaler was designed (based on several earlier iterations and previous experience) around the idea it's cheaper both in terms of code maintenance and $ cost, to do the simple action and fail sometimes rather than strive to do the optimal actions all of the time and avoid failure at all costs.

It doesn't know how containers per instance it can fit until some instances have been created. When there aren't any instances, it therefore makes the safe assumption of 1 task per container: See here

The EC2Scaler algorithm is:

  1. How many containers per instance? No instances to calculate from? Assume 1.
  2. Calculate how many instances it needs.
  3. Tell EC2 Auto Scaler to scale to that size, whether lower or higher than current. *
  4. Launch any required tasks.
  5. If some launches from 4 fail, then don't worry, we will run again in 1 minute.

* If the scaler tries to scale down then SafeTerminationLambda prevents terminating instances with running containers.

It looks like the compaction task creator lambda fails halfway through a number of times

Yep, at the beginning when it is waiting for the EC2 AutoScaler to commission the extra instances this will happen.

It starts off thinking it needs 1 task per instance, then changes its mind after a few failures trying to actually create the tasks, and for the rest of the lambda invocations it decides it should have 4 containers per instance instead.

See here. The "few failures" are it waiting for the AutoScaler to finish creating/commissioning the EC2s it previously asked for. Once this happens, we can interrogate an instance and get a proper idea of how many containers per instance. See here. That's why it "changes its mind": it suddenly has proper information to work with.

In one example of the test, we ended up with 6 compaction jobs each on their own dedicated EC2, and 4 compaction jobs all on the same EC2. The 4 that ran on the same instance ran at about 180,000 records per second, whereas the other 6 ran at about 300,000 records per second. The EC2 that had 4 compaction jobs ran at over 99.8% CPU utilization the whole time, where the other 6 EC2s ran at just over 25%.

It aims for high resource utilisation on the instances to minimise the number we need. We already add a 5% "hedge" on the memory usage (Explanation here) to prevent ECS complaining it can't fit containers in.

This does appear to the cause of the issue, it should try to pack fewer containers on to an EC2.

Analysis

The compaction task starter could wait for the autoscaler to apply the desired number of instances before it starts trying to run tasks. It's not clear how worthwhile this is though, as it will run again in a minute anyway.

I definitely wouldn't do this. As you say, it will run again in a minute anyhow. But there's more to it than this. Options:

  1. Have the Lambda wait around for the AutoScaler to make the requested changes. BAD. It will likely take longer than a minute -> Should the lambda give up waiting? What if another instance of the task creation lambda starts? How to handle failure conditions etc? This adds significant complexity for very little benefit.
  2. If a scale up has been requested, don't attempt to launch tasks, just quit and let them launch next time. BAD. One minute later, containers still probably won't exist, and the lambda won't know if it recently scaled up or not? It has no state, so what should it do? Also, you MAY have space for SOME of your containers. You want these to launch as soon as possible and not wait for all the needed space to be available.

This can be calculated from the instance details as it is now, but it should be consistent.

I considered making it try to "lookup" the CPU and memory based upon instance type from AWS. Not sure why I decided against this now. Just looking it up from running instances seemed better at the time. This might be a viable alternative if having inconsistent containers / instance is seen as bad enough to fix, but I'm not convinced it is.

It might also be worth configuring the number of containers per instance directly in an instance property.

It could be worth allowing a configurable "default" number of instances per container to be set in the instance properties. The risk of this is that you have two dependent properties: the EC2 instance type (COMPACTION_EC2_TYPE) and this new "containers per instance". Just changed one? Hope you didn't forget to change the other! I can almost see the bug report "I changed the compaction instance type to a [super high spec. machine] and Sleeper is STILL only scheduling 2 tasks per instance!"

How about a "minimum containers per instance" to replace step 1 in the algorithm above? If it can't read the actual number, it assumes this minimum rather than 1. Then you avoid the risk above.

We need to decide how much of a problem it is that we see this level of performance degradation at full utilization. If 3 tasks on one EC2 maintain the 300,000 records per second, that would be 900,000/s for the whole EC2, compared to the 720,000/s we saw for 4 tasks on one EC2. If we make this more configurable this may be less of a problem.

This points to needing to relax the computation of how the scaler computes the number of containers per instance. Rather than trying to cram on as many as we have the CPU/RAM for, we should add in some more headroom?

Recommendation

  1. Introduce a "minimum assumed containers per EC2 instance" property. Change EC2Scaler to assume this number if it can't get an actual reading from the cluster.
  2. Add more headrooom to the scaler to prevent it over provisioning containers to an instance.

@gaffer01 gaffer01 removed the bug Something isn't working label Oct 4, 2024
@patchwork01
Copy link
Collaborator Author

If we make the estimated containers per EC2 configurable, that would add an extra configuration property someone would need to find, but it would also be misleading since it wouldn't actually determine how many containers AWS will assign to each instance.

We have two separate problems:

  1. When we start to scale up from zero, it requests more instances than are needed.
  2. When we start tasks on the correct number of instances, AWS over provisions each instance, so that all vCPUs are running tasks and we see slowdown.

These are related, but I think we need to solve both. A few things to note:

  • When we request more instances than are needed at first, tasks are actually started on those instances, so we can only ever scale back down if those tasks terminate.
  • I don't think we've tested yet whether we see any slowdown running 3 tasks on 4 vCPUs. I'm not sure how we would test that as we are now, but it would be useful to have more control over this to be able to test.
  • We need a solution to avoid the capacity provider over provisioning each instance, not just to request the right number of instances from the auto scaler.

@patchwork01 patchwork01 modified the milestones: 0.26.0, 0.27.0 Oct 11, 2024
@m09526
Copy link
Member

m09526 commented Oct 14, 2024

I we use the AWS EC2 describe-instance-types API that you mentioned at runtime to find out the CPU/memory available on our chosen instance, then we get a deterministic value for number of containers per instance. This should stop it requesting more instances than needed and also eliminate bullet point 1. I think this should be done at runtime to account for when the "instance type" is changed post deployment.

To prevent over provisioning, I think increasing the memory requirements for each task is the easiest way to achieve this. CPU is too coarse for this. E.g. for a task that gets 1 vCPU and 4 GiB of memory, if we increase the requirements to 2 vCPU, then we've halved the number of tasks that can be provisioned on a single instance which is probably too conservative. Whereas, if we increase the memory to 5 or 6GIB, we can get finer control to prevent over-provisioning.

This looks like 2 PRs:

  1. Replace the describeInstances API with one that uses describe-instances-types to look up at runtime what the number of instances should be.
  2. Increase memory requirement on tasks to reduce over-provisioning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants