-
-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoscaler #999
Comments
Some questions we have to answer in general:
|
I'm not for sure how relevant this is, but BuildKite [1] has something similar for AWS but leverages external tooling to do so. The BuildKite service has an API that an AWS Lambda polls every minute and asks for pending jobs. If there are pending jobs it increase the count on an autoscale group. A new instance is fired, agent connects, and puck up the pending job. That handles the autoscaling increment operation. The other part is if the agent is idle for X minutes it will then shut itself down, as part of shutting down it will decrement the autoscale group. I think this design is pretty nice as it has the benefits of:
|
I'm not sure about a unified package, but I don't think wrapping the various go packages from different providers should be terribly difficult (ex. edit: maybe using terraform could help https://github.com/hashicorp/go-tfe
From woodpecker host it would need:
For setup configs it would need:
From the provider, it will need to poll/sync:
Hopefully this helps (I'm probably missing a lot of interfaces, but maybe we can update this as we learn more) |
This should probably be created as separate service/repo to make it more easy to maintain and not to bring too many dependencies into core |
how to do it externally :) -> https://github.com/windsource/picus |
I would say we should wait for #1189 ... and if we have it we can calculate if and when we would need to start or stop new agent instances ... |
In order to have an external autoscaler service, two things are currently missing in Woodpecker from my point of view: Detect those agents that are idle In case the external autoscaler as spinned up several agents and there are later less pending jobs, then the autoscaler should stop one or more agents. But in this case those agents need to be detected that currently do not have a job running. As the agents do not have an API to use from an external service (I think) the Woodpecker server is the only point of contact. I have already checked the API Mark agent to not schedule jobs anymore Before an agent is stopped the server should be told not to schedule any more jobs on that agent. In case this is not done there could be a race condition such that there are no jobs running on that agent, the external autoscaler stops the agent but the server has already scheduled a job on that agent in the meantime. That would cause the job being interrupted until it runs in a timeout (?). Could a corresponding API be added to the Woodpecker server (or agent)? Note: I am the author of Picus and currently thinking about to extend Picus to a real autoscaler for Woodpecker. |
@windsource Great suggestions. I started to add an agents list in #1189 which would be the first step into that direction to allow the server to identify an agent. (I am mainly needing another maintainer to review this) After that PR got merged adding a link between agents and queue entries should be easily doable. The do not schedule flag is a great idea as well. Would also be possible after to add after #1189. |
An other feature might be useful here as well: Assign priorities to agents I wonder how Woodpecker server schedules jobs on agents when there is more than one agent available. Let's assume the following situation: There are 3 static agents (maybe on premise) already connected and the external autoscaler is able to dynamically start more agents on high load situations in the cloud. When load gets less the remaining jobs should be scheduled on the static agents with priority such that the cloud agents can be stopped again. When the user can provide a parameter 'priority' to each agent that could be possible. @anbraten what do you think about that? |
@windsource Not sure if that is really needed. At least for now I would put it on the long ideas list 😉 |
Hi! Since woodpecker is already using containers for everything, why not simply use containers for autoscaling too? Call the autoscaling container for every pending and finished job and provide some info, e.g. total number of pending jobs etc. as env variables. The container can then decide by itself what to do. I would like to call my terraform scripts for example. Others might just call some web api. But this way there could be autoscaling plugins. Just an idea.. |
FYI with #1631 we should soon be able to properly know which agent has nothing to do and can be removed. |
I will start creating an external autoscaler service written in golang (to be able to share some code) over at https://github.com/woodpecker-ci/autoscaler. If anyone is interesting in helping pls reach out to me. @windsource I really like what you started with picus ❤️ , maybe you are interested in working on the go implementation as well. |
Save which agent is running a task. This is now visible in the admin UI in the queue and in the agent details screen. # changes - [x] save id of agent executing a task - [x] add endpoint to get tasks of an agent for #999 - [x] show assigned agent-id in queue - [x] (offtopic) use same colors for queue stats and icons (similar to the ones used by pipelines) - [x] (offtopic) use badges for queue labels & dependencies ![image](https://user-images.githubusercontent.com/6918444/226541271-23f3b7b2-7a08-45c2-a2e6-1c7fc31b6f1d.png)
well I was going to implement it as project for my bachelor thesis this semseter ... |
Hi @anbraten, that's good news. Currently Picus is only able to scale a single agent up and down but I already thought about how to handle more agents but did not have the time to implement that yet. Maybe we can exchange some ideas about the autoscaler in golang. The tricky part is startig the agent. In Picus I used different methods dependent on the cloud provider. In AWS you do no pay for stopped instances (except for the block storage) so I created one instance at beginning which is then started and stopped by the autoscaler. The advantage of that solution is, that build images are already present in the block storage and do not need to be pulled again. Also the agent gets started very quick. The disadvantage is, that it does not scale to more than one agent (except if you have prepared multiple agents like that). I am not sure how important it is to have all images from the last build already present on the newly started agent but it would be useful I think. For Hetzner cloud the autoscaler starts a new instance and I use cloud init to setup the instance with docker compose and and the woodpecker agent image. In that case all images (also build images) need be pulled when the agent is started. That also works quite well but puts some load on the container registries. For AWS that method could be used as well to scale for more than one agent. One could pre-generate an AMI and use an autoscaling group which is then configured by the woodpecker autoscaler. An alternative for AWS would be instance templates. Not sure if this would be better. Regarding build images already being available on the agent: If there is more that one agent probably we cannot guarantee that a single project is always assigned to the same agent, or can we and how? Of course there are labels but if the agents all have the same config? |
I've already started the basic scaling logic a bit. Most interestingly is probably the code how I calculate the diff for new / less agents at the moment (not sure if it is the best approach): https://github.com/woodpecker-ci/autoscaler/blob/457c0d0545157c03a91829ef844e9d6b322685d2/main.go#L146-L174
It is "only" some kind of cache then, so it should definitely work without at least in the beginning. But especially in view of a conscious and sustainable use of computing resources we should optimize this flow later on. For Hetzner I saw that this kubernetes autoscaler is creating an initial snapshot and creates new nodes using that snapshot: https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner#autoscaling-node-pools I guess in general the kubernetes autoscaler project could provide us some nice insights how to do scaling and how the cloud-providers could be used: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider CC @windsource |
Only having skimmed this read, want about forking https://github.com/drone/autoscaler |
Biggest problem is its license 🙈 |
Can we get Linode on the list of cloud providers? 😋 |
The first container image for the autoscaler (atm just with Hetzner being added as the first cloud provider) was just released: woodpecker-ci/autoscaler#1 https://hub.docker.com/r/woodpeckerci/autoscaler Would be nice if some of you can test it and provide feedback or maybe even start on adding new cloud providers 😉 |
@anbraten Love the autoscaler. The core and calculations look sound and do what is needed. Now running a self built image including the linode driver I created. It is doing its thing without issue. |
@guisea Aweseome. Thanks for the feedback. |
Closing this one as we will track further development in the autoscaler repo: https://github.com/woodpecker-ci/autoscaler |
Clear and concise description of the problem
As a potential user of Woodpecker I would really like to be able to provision my agent servers on an "as-needed" basis, without having to support Kubernetes.
Suggested solution
Support a Woodpecker-Autoscaler image, which will accept user credentials to spin-up and shut down agent instances as necessary.
Roadmap:
Support major cloud providers:
Alternative
I believe Kubernetes is currently proposed as the alternative to autoscaler for woodpecker, but requires all of the infrastructure (and cost) associated with that.
Additional context
I know this has been discussed at many points before on discord, but I'm not sure if it's been determined that it's out of scope or not. It would be helpful to track that discussion here.
Validations
The text was updated successfully, but these errors were encountered: