-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to prioritize GitHub Action runners #1665
Comments
Hi @gajus, |
@ruvceskistefan Do you have any update on this? This severely impacts our ability to scale GitHub Runners, costing literally tends of thousands monthly. |
Hi @gajus - can you tell me more about this? It sounds like you're rolling your own autoscaling solution? Does the ephemeral runner option not give you enough control here? |
The original issue text already includes description of how we orchestrate runners. Whether we use ephemeral option or not, the problem is that there is no way to prioritize which runners will be picked up first. This means that if we have 100 idle runners and we have 20 jobs, then we have no way to say that these 20 should idle runners should be used first. Without weight of such for sorting, this means that a large number of idle runners is just sitting waiting for jobs because they keep getting random jobs (which they would otherwise not get if there was an order assigned). |
What I don't understand yet is whether these 20 runners in your example are different in some meaningful way that you really want to route to these 20? In other words, does it matter which 20 runners are getting jobs, or do you just want to scale down to any 20 runners? You mention "more efficient resource packing" but I don't feel like I have the full picture yet. |
Chiming in that we have a pool of machines that we use as runners, though some machines run significantly faster than others (reducing build time significantly). Ideally we want to be able to prioritize allocating jobs to the faster machines to reduce build times, but want to keep the slower machines active so they could pickup jobs while the faster ones are busy. Prioritization would be very beneficial here. |
Thanks @al2114 - are you running static runners? I can understand the need for some more advanced routing there, but I'm trying to better understand the need for weights when auto scaling, or when using some sort of control plane. |
No. All machines are identical. Here is a simple task. You have 100 machines. You have 20 jobs that start every minute and complete in a minute. What happens in the current setup? Every minute a random 20 machines will get picked from the pool. Why is that bad? Machines that have not been used for 10 minutes are automatically removed from the pool. If resources are randomly assigned, then machines that otherwise would not need to have been used are being used. Therefore, you will always have 100 machines running even though 20 would suffice. What's desired? A way to prioritize which machines should get picked first. This way, the oldest machines (as an example) will always get used first and the rest will soon timeout and disconnect from the pool. |
your autoscaling solution is probably a bit too naive. your assumption that the jobs are assigned is also wrong (I'm pretty sure). jobs are pulled by the runners, and not pushed to them. each runner periodically queries to see if any work is available. the first runner to pull that info after it is available gets it, and does the work. in order for a priority system to be implemented, all runners you host would need to talk to each other to know who should poll next. the GitHub Actions Runner system, in its entirety, is making the assumption that the runner virtual machines are used a single time, then destroyed and replaced with a fresh VM when needed. solutions which do not keep that assumption in mind are going to have a difficult time adjusting to how GHA works. I wrote an orchestrator in Go which uses |
I would like something kind of relevant to this. I would like to be able to give the runners the ability to opt out of polling, based on some health check. In my instance, I have a a few VMs, each with a few runners running, and I want a runner to be able to recognize that there is, lets say 95% memory usage, and not pick up a job. This will allow a runner on a less congested VM to pick it up. Right now, sometimes a congested VM will still pick up the job, and then oom. This would not require runners communicating with each other, but basically just polling some endpoint, either through like curl or some file/socket, and if the number is 1 then pick up the job, if 0 then don't Would be super helpful in distributing jobs |
As another example for this, we have M1 and Intel self hosted Mac runners. The M1s are so much faster that we'd love a way to give them priority over the intel runners and only send jobs to the intel runners if all the M1 runners are busy. The weight solution would work but really anything that allows us to set precedence would be great. |
I am also interested in this. This is my generic question which lead me here: https://github.com/orgs/community/discussions/30693 This issue seems like a great place to add more context and make it specific so that you can better determine if this is a legit Our Test Universe workflow is highly variable, but it usually takes When we parallelise, it completes in less than If our own self-hosted GitHub runners are not available (busy, offline, etc.), free GitHub Runners should pick up those jobs. Currently, if we were to use I have two questions:
Thank you! |
The new larger GitHub Actions hosted runners makes my previous comment a non-issue. This new feature made a huge positive different for us already: dagger/dagger#3277 (comment). Great job everyone! 🤘 |
Interested this feature, too! Maybe there are some need to change on actions-runner-controller either, like scaledown hook or relocate runner pods. But! I wonder the change from this feature. 🙂 Thanks for every contributors. 🤘 |
I'm also interested in this. We are using self-hosted runners to provide different testing hardware environments and therefore have some runners with few labels and some runners with many labels. Our issue is that it happens quite often that jobs with fewer labels get picked up by runners with many labels and therefore the jobs that needs more labels have to be queued. |
I also support this use case. With a big heterogenous runner pool (100+ runners with 2-64 core CPUs) with lots of label variations and attached HW it is very hard to utilize all HW efficiently both in high-load and low-load scenarios. Ideally, the load balancer should use the history of jobs and the history of runners to do dynamic scheduling. Doing hard-coded weighting, as proposed here, is going to be hard to do correctly at scale, and maintain it when jobs change characteristics or when new types of runners are added. For this solution to work, I think at least it must be exposed in an API so that it is possible to re-weight all runners programmatically at a schedule. Still, it would be a more impactful feature if GitHub could schedule for us automatically. |
Adding this enhancement would make a lot of sense and help the users a lot. For example, we can prioritize the runners with lesser latency first (we can label it based on location of those VMs) and the other runners (VMs placed elsewhere) as second priority. |
Adding my 2 cents here. It would be great if we could remove the default labels ( For example, if you have a set of runners with custom labels like:
And another with:
And the workflow author defines just Runner groups are only available to enterprise users. Removing the default labels would be useful even for single repos with more collaborators and free tier orgs that a lot of open source projects use. Allowing us to remove the default labels will give us the ability to define unique label sets and thus schedule jobs more efficiently. It also allows us to better react to Hoping this makes sense 😄 . |
An unsupported way to remove the default labels is to delete them from the configuration function. runner/src/Runner.Listener/Configuration/ConfigurationManager.cs Lines 531 to 533 in 982784d
After you have deleted these 3 lines, compile the actions/runner and use it to configure all your runners. Last time this worked just fine as long you have provided your own labels. You don't have to worry about auto updates as long your runner is already configured, your label change has been stored online and won't change. |
Yup. I wanted to create a PR that adds a |
if a workflow author is not specific with their requirements via the labels, that is on them, in my mind. I would set up a webhook which sends all workflow runs to a tool which reads them and files a new issue on every repo which runs actions that only specify our guidelines are to always specify the OS and CPU architecture in labels at a bare minimum. |
Yes, it is on them, but in the meantime, they may end up needlessly consuming instances that are more expensive/scarce (like GPU enabled instances). It also makes it difficult to spin up the right instance types, on the right hierarchy level (repo vs org vs enterprise), on-demand. In any case, I opened a PR here: #2443 It makes the default labels optional (by default they are added), while still ensuring at least one label is added to the runner. It feels like better UX to add only the labels you want. If that includes the default labels, great. If not, also great. |
any news on the prioritization? seems like a core feature that's missing. |
yes, the PR #2443 is good one. But prioritization is definitely much needed feature. A workflow author should be able to say "Use runners with these labels if they are available, if they are not available, use runners with another label". Currently that's not possible. As 'xucian' mentioned in previous comment, either we have to wait for the high-resource runners without utilizing the priority 2 (low-resource ) runners if we set this strictly to match one of them. Or if we match the labels to match both of them, we have to live with the compromise of not utilizing the high-resource runner 50% of the time, even though it might be available. |
I think it would be even more useful to define required labels when configure/register a new one. e.g.
|
good idea, finer-grained (optional) control is always welcome. before that, we can just have the positions of the labels to implicitly denote priority. I think 99% of the limitations will be solved this way |
As you already know, we use Dagger for CI/CD. By default, this runs on Fly.io (via Docker). In some cases, this can fail. The last failure was when DNS resolution stopped working after the Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io machines), e.g. https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1 As a temporary fix, we had to delete some secrets and re-run the job. The job ran on GHA free runners & failed for genuine reasons 6 mins later: https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391 While running on the free GHA runners can be 3x-8x slower, it's a good fall-back. You heard us mention on multiple occasions: "always have redundancies in place". Since we already have multiple CI runtimes in place (Fly.io. K8s), let's make our GHA workflow resilient by: - Run on our preferred back-end by default (Dagger on Fly.io) - ✅ If it succeeds, we are done - ❌ If it fails, fallback to running on the free GitHub runners - In forks, use free GitHub runners by default (we cannot share `secrets`) While this means that a workflow which fails for genuine reasons will fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems like a better place to improve from. This change goes one step further. We are using a third back-end: Dagger on K8s. This uses a self-hosted GitHub runner on K8s which is already integrated with Dagger. For now, we are using it just to see how the CI part compares to our primary setup (Dagger on Fly.io). We are not using Dagger on K8s to deploy the app. Let's see how this setup behaves over a few weeks/months before we consider taking it further. Part of this, we also improved on how we check for Fly.io connectivity. Things that could be improved in follow-ups: - the workflow should succeed if the `dagger-on-github-fallback` job succeeds - currently it fails if `dagger-on-fly-docker` fails - add `dagger-on-k8s` job as secondary fallback - GitHub Actions is currently missing actions/runner#1665 - maybe leverage a Dagger cache that works in forks too 😉 - Run Dagger Engine as a Fly Machine (no more Docker) - thechangelog#471 Signed-off-by: Gerhard Lazu <[email protected]>
As you already know, we use Dagger for CI/CD. By default, this runs on Fly.io (via Docker). In some cases, this can fail. The last failure was when DNS resolution stopped working after the Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io machines), e.g. https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1 As a temporary fix, we had to delete some secrets and re-run the job. The job ran on GHA free runners & failed for genuine reasons 6 mins later: https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391 While running on the free GHA runners can be 3x-8x slower, it's a good fall-back. You heard us mention on multiple occasions: "always have redundancies in place". Since we already have multiple CI runtimes in place (Fly.io. K8s), let's make our GHA workflow resilient by: - Run on our preferred back-end by default (Dagger on Fly.io) - ✅ If it succeeds, we are done - ❌ If it fails, fallback to running on the free GitHub runners - In forks, use free GitHub runners by default (we cannot share `secrets`) While this means that a workflow which fails for genuine reasons will fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems like a better place to improve from. This change goes one step further. We are using a third back-end: Dagger on K8s. This uses a self-hosted GitHub runner on K8s which is already integrated with Dagger. For now, we are using it just to see how the CI part compares to our primary setup (Dagger on Fly.io). We are not using Dagger on K8s to deploy the app. Let's see how this setup behaves over a few weeks/months before we consider taking it further. Part of this, we also improved on how we check for Fly.io connectivity. Things that could be improved in follow-ups: - the workflow should succeed if the `dagger-on-github-fallback` job succeeds - currently it fails if `dagger-on-fly-docker` fails - add `dagger-on-k8s` job as secondary fallback - GitHub Actions is currently missing actions/runner#1665 - maybe leverage a Dagger cache that works in forks too 😉 - Run Dagger Engine as a Fly Machine (no more Docker) - #471 Signed-off-by: Gerhard Lazu <[email protected]>
Any update on this ticket? I want to prioritize M2 Pro machines instead of M1 machines (for example). |
if there were updates on this issue you would see updates right here. |
Im in the same boat. I'd like to prioritise certain servers over others as build times can vary as much as 5x depending on the server. |
The only way to do this currently is to label your larger runners with different labels than your smaller runners. And once you do this, your users will discover a new way to make you lose the fight, and everyone will choose the larger runners because they're faster. |
I really need this feature. |
Agree, that's not even a solution, just a rabbit hole. And, it's not prioritization, just hard coding and limiting the chance of saturation of runner resources :( One more viable solution is to implement a daemon to dynamically add and remove labels based on queue size for various labels and runner capabilities. Another is to use Kubernetes or similar to spin up and down runners on demand based on similar rules. I have not yet tried implementing any of those as cost/value is not there yet for our case, it is not trivial to implement. In the meantime, I'm crossing my fingers (and paying the HW cost of unused/redundant resources) and hoping GitHub implements something usable before we hit the wall. Another notable hack is to put some runners on the organization level and some on the repository level, as these have different priorities. However, it is a very limited hack with only one axis and only two levels of "prioritization". I'm really interested to hear if anyone implemented a real solution to this problem, and they want to share it here. |
GitHub have a solution for this called Actions-Runner-Controller. Apparently it works quite well. We're testing it at my employer. |
this seems to be more about scaling similar machines (which I think kubernetes is generally good at), whereas prioritizing is a bit different. for ex., I have 3 windows runners, 2 mac runners, 6 linux runners, each with different specs, and I want to have them all running at the same time, and assign priorities at the job-level (some are expected to eat more ram than others, but I still want to run them if a less-preferred runner is active. idling is the worst and sometimes causes cicd deadocks) |
This would be fantastic |
We had this issue and just realised that something that works for us is to use a single space-separated label instead of a YAML list: runs-on: python highmem gpu or equivalently runs-on: >-
python
highmem
gpu It works for us because we can parse these in any order and launch JIT runners, but maybe this could be part of a solution for others. |
can you expand on this? is this making github look for "python highmem gpu" (and thus hang), and meanwhile you somehow intercept this and launch runners on-demand? thanks! |
|
I am really sorry for the length of this post. You can do this with an "autoscaler" (there are several out there - ARC, for example. We also wrote one, but I don't want to do any shameless plugs). It's fairly easy to roll your own if you don't need anything too generic, by relying on github webhooks to let you know when a new job is queued. Workflow job payloads that you get from github have a bunch of fields set and a few headers of interest. In the header you have the signature of the payload in case your webhook uses a secret (which it should), and the entity that the webhook is meant for (repo, organization or enterprise). In the payload itself, you get the repo that the workflow originated from, and details about the workflow that triggered it. As part of the payload you also get the labels that were set in Now, a few things to know about how runners pick up jobs: Runners can be registered at the The The runners from all hierarchy levels can have the exact same labels set, and in most cases, the default labels ( You can't predict which runner will pick up the job. And this is the challenge when dealing with workflows that use label sets that match multiple types of runners an entity might have. To get the most predictable results, workflow authors should always use unique and non ambiguous label sets to target runners. For example, let's say you have the following sets of runners: enterprise:
labels: [enterprise, linux, self-hosted, gpu]
org:
labels: [org, linux, self-hosted, sr-iov]
repo:
labels: [repo, linux, self-hosted, gpu] If you use only You can even have multiple types of runners at the same hierarchy level. So you can have two types of runners at the repo, both of which have the The point is to use a set of labels that uniquely identifies the runners you're interested in. What @lordmauve suggests is to create a label that is a space delimited string. In effect, that is only one label, but it's unique to only that set of runners. If you register your runners with that label and craft your workflow to target that unique label, you should get a predictable result every time. This is not a bad approach, but depending on what you want to do, it may not be practical. You may want to be able to target runners at both the repo level and the org level. Or you may have different types of runners that share a set of labels and you'd be ok with either type picking up the job. The problem here is that we can't set a priority on which runners pick up the job. The only control any autoscaler has is what type of runner it spins up when a new job comes in. But this can never be 100% reliable. For example, let's say we have an org with 2 repos. Each repo has a workflow that uses Now in this case, you might be inclined to say: "okay, but you can use a runner group to limit repos that can use GPUs". Yes, you could do that, but there are 2 potential issues with that approach:
This makes the job of the autoscaler almost impossible. If you only have the So, currently, the best way to make sure your jobs run on the runners you want, is to use unique sets of labels to register your runners and inside your workflows. The problem with this approach is that in large organizations with many teams, in most cases, individual teams won't use anything more than Ideally, we could have a set of "rules" or "filters" that we can set up in github, which will dictate to which runner a job will be routed. But we don't have that now, and I am not sure if it's something that the amazing folks at GitHub want to have as a feature. In the meantime, a unique set of labels should do. |
I wrote something like this. It listened to webhooks and spawned containers whose runners were configured to match the labels requested. If a workflow asked for a larger runner, I created a larger container (more than default CPU and RAM) and registered it. My problem was that I have enough simultaneous jobs being run at any moment that less specific jobs would be taken by the more specific runners, then the more specific job would be left high and dry, because the runner I spawned for it was taken by some job that only asked for 'Linux', for example. It was murder to communicate to all 12k developers on our GHES instance about how to use labels, and we eventually gave up. There is currently no way to guarantee that a specific runner takes a specific job and that is a minor source of annoyance. All I can do is have so many runners of each specific configuration ready to go at all times. |
Can we please get this feature? |
I would like this feature too. Some runners just don't have enough CPU power to be as quick as others due to cost limitations. The preferred runner would make the best of this situation by prioritizing high CPU machines when they are readily available. |
We faced a similar issue and found a workaround. In my experience, the runner selection mechanism of GitHub Actions is not entirely random. It tends to pick runners alphabetically. 😄 We have M1 and M2 machines, which we named To address this, we changed the names to This might be a naive workaround, but it solved our issue. |
Describe the enhancement
Self-hosted GitHub actions should have an attribute (
weight
) that allows to prioritize them, i.e. If there are multiple idle runners with matching labels, then theweight
attribute would determine which runner to use first, e.g. prioritized in ascending order.Additional information
For context, the reason this is needed is because the current implementation randomly picks an available runner. However, imagine that you are scaling up and down runners depending on how long they have been idle. Using random allocation mechanism, there is no way to determine (efficiently) how long the runner was not in use. As a result, we have a large portion of VMs runnings that are not in use most of the time.
Prioritization would allow more efficient resource packing.
The text was updated successfully, but these errors were encountered: