Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler #999

Closed
8 of 15 tasks
kdumontnu opened this issue Jun 22, 2022 · 23 comments
Closed
8 of 15 tasks

Autoscaler #999

kdumontnu opened this issue Jun 22, 2022 · 23 comments
Labels
feature add new functionality summary it's a summary for lot of issues

Comments

@kdumontnu
Copy link

kdumontnu commented Jun 22, 2022

Clear and concise description of the problem

As a potential user of Woodpecker I would really like to be able to provision my agent servers on an "as-needed" basis, without having to support Kubernetes.

Suggested solution

Support a Woodpecker-Autoscaler image, which will accept user credentials to spin-up and shut down agent instances as necessary.

Roadmap:

Support major cloud providers:

Alternative

I believe Kubernetes is currently proposed as the alternative to autoscaler for woodpecker, but requires all of the infrastructure (and cost) associated with that.

Additional context

I know this has been discussed at many points before on discord, but I'm not sure if it's been determined that it's out of scope or not. It would be helpful to track that discussion here.

Validations

@anbraten
Copy link
Member

Some questions we have to answer in general:

  • Is there maybe a go package Woodpecker could use or would it require to write some extendable system to support all kind of hosters?
  • Which metrics does it need to decide about rescaling.

@mr337
Copy link

mr337 commented Jun 30, 2022

I'm not for sure how relevant this is, but BuildKite [1] has something similar for AWS but leverages external tooling to do so. The BuildKite service has an API that an AWS Lambda polls every minute and asks for pending jobs. If there are pending jobs it increase the count on an autoscale group. A new instance is fired, agent connects, and puck up the pending job. That handles the autoscaling increment operation.

The other part is if the agent is idle for X minutes it will then shut itself down, as part of shutting down it will decrement the autoscale group.

I think this design is pretty nice as it has the benefits of:

  • The woodpecker ci project doesn't need to make modifications to talk to all the different clouds, exposes a pending "steps" endpoint and nothing more
  • The per cloud implementation is outside of the codebase and could be supported in a Lambda for AWS, or Function in Azure, I think that is the name, or which ever method for the relative cloud. Or even the same machine as the server as another container.

[1] - https://github.com/buildkite/buildkite-agent-scaler

@kdumontnu
Copy link
Author

kdumontnu commented Jul 1, 2022

Some questions we have to answer in general:

  • Is there maybe a go package Woodpecker could use or would it require to write some extendable system to support all kind of hosters?

I'm not sure about a unified package, but I don't think wrapping the various go packages from different providers should be terribly difficult (ex. google.golang.org/api/googleapi, github.com/aws/aws-sdk-go/aws, ....). I think if we build an autoscaler for one or two services, the community could very easily add support for others as needed.

edit: maybe using terraform could help https://github.com/hashicorp/go-tfe

  • Which metrics does it need to decide about rescaling.

From woodpecker host it would need:

  • number pending runs

For setup configs it would need:

  • min ttl for agents
  • min number of agents
  • max number of agents
  • instance type
  • instance image (we can set a default here)
  • authentication information for provider
  • tags for provider? (this is optional, but most providers allow you to tag instances)

From the provider, it will need to poll/sync:

  • number available agents
  • number of running agents

Hopefully this helps (I'm probably missing a lot of interfaces, but maybe we can update this as we learn more)

@lafriks
Copy link
Contributor

lafriks commented Jul 15, 2022

This should probably be created as separate service/repo to make it more easy to maintain and not to bring too many dependencies into core

@6543 6543 added feature add new functionality and removed pending:feature labels Sep 24, 2022
@6543
Copy link
Member

6543 commented Sep 24, 2022

how to do it externally :) -> https://github.com/windsource/picus

@6543
Copy link
Member

6543 commented Oct 20, 2022

I would say we should wait for #1189 ... and if we have it we can calculate if and when we would need to start or stop new agent instances ...

@6543 6543 added the summary it's a summary for lot of issues label Oct 25, 2022
@anbraten anbraten mentioned this issue Nov 26, 2022
31 tasks
@windsource
Copy link

windsource commented Jan 22, 2023

In order to have an external autoscaler service, two things are currently missing in Woodpecker from my point of view:

Detect those agents that are idle

In case the external autoscaler as spinned up several agents and there are later less pending jobs, then the autoscaler should stop one or more agents. But in this case those agents need to be detected that currently do not have a job running. As the agents do not have an API to use from an external service (I think) the Woodpecker server is the only point of contact. I have already checked the API /api/queue/info but it does not provide any information about the agent host where a pipeline is running. Could that be information be retrieved from extern in any other way?

Mark agent to not schedule jobs anymore

Before an agent is stopped the server should be told not to schedule any more jobs on that agent. In case this is not done there could be a race condition such that there are no jobs running on that agent, the external autoscaler stops the agent but the server has already scheduled a job on that agent in the meantime. That would cause the job being interrupted until it runs in a timeout (?). Could a corresponding API be added to the Woodpecker server (or agent)?

Note: I am the author of Picus and currently thinking about to extend Picus to a real autoscaler for Woodpecker.

@anbraten
Copy link
Member

@windsource Great suggestions. I started to add an agents list in #1189 which would be the first step into that direction to allow the server to identify an agent. (I am mainly needing another maintainer to review this)

After that PR got merged adding a link between agents and queue entries should be easily doable. The do not schedule flag is a great idea as well. Would also be possible after to add after #1189.

@windsource
Copy link

An other feature might be useful here as well:

Assign priorities to agents

I wonder how Woodpecker server schedules jobs on agents when there is more than one agent available. Let's assume the following situation: There are 3 static agents (maybe on premise) already connected and the external autoscaler is able to dynamically start more agents on high load situations in the cloud. When load gets less the remaining jobs should be scheduled on the static agents with priority such that the cloud agents can be stopped again. When the user can provide a parameter 'priority' to each agent that could be possible.

@anbraten what do you think about that?

@anbraten
Copy link
Member

@windsource Not sure if that is really needed. At least for now I would put it on the long ideas list 😉

@gmuellerinform
Copy link

Hi! Since woodpecker is already using containers for everything, why not simply use containers for autoscaling too? Call the autoscaling container for every pending and finished job and provide some info, e.g. total number of pending jobs etc. as env variables. The container can then decide by itself what to do. I would like to call my terraform scripts for example. Others might just call some web api. But this way there could be autoscaling plugins. Just an idea..

@anbraten
Copy link
Member

FYI with #1631 we should soon be able to properly know which agent has nothing to do and can be removed.

@anbraten
Copy link
Member

I will start creating an external autoscaler service written in golang (to be able to share some code) over at https://github.com/woodpecker-ci/autoscaler. If anyone is interesting in helping pls reach out to me.

@windsource I really like what you started with picus ❤️ , maybe you are interested in working on the go implementation as well.

lafriks pushed a commit that referenced this issue Mar 21, 2023
Save which agent is running a task. This is now visible in the admin UI
in the queue and in the agent details screen.

# changes
- [x] save id of agent executing a task
- [x] add endpoint to get tasks of an agent for #999 
- [x] show assigned agent-id in queue
- [x] (offtopic) use same colors for queue stats and icons (similar to
the ones used by pipelines)
- [x] (offtopic) use badges for queue labels & dependencies


![image](https://user-images.githubusercontent.com/6918444/226541271-23f3b7b2-7a08-45c2-a2e6-1c7fc31b6f1d.png)
@6543
Copy link
Member

6543 commented Mar 22, 2023

well I was going to implement it as project for my bachelor thesis this semseter ...

@windsource
Copy link

Hi @anbraten, that's good news. Currently Picus is only able to scale a single agent up and down but I already thought about how to handle more agents but did not have the time to implement that yet. Maybe we can exchange some ideas about the autoscaler in golang.

The tricky part is startig the agent. In Picus I used different methods dependent on the cloud provider. In AWS you do no pay for stopped instances (except for the block storage) so I created one instance at beginning which is then started and stopped by the autoscaler. The advantage of that solution is, that build images are already present in the block storage and do not need to be pulled again. Also the agent gets started very quick. The disadvantage is, that it does not scale to more than one agent (except if you have prepared multiple agents like that). I am not sure how important it is to have all images from the last build already present on the newly started agent but it would be useful I think.

For Hetzner cloud the autoscaler starts a new instance and I use cloud init to setup the instance with docker compose and and the woodpecker agent image. In that case all images (also build images) need be pulled when the agent is started. That also works quite well but puts some load on the container registries.

For AWS that method could be used as well to scale for more than one agent. One could pre-generate an AMI and use an autoscaling group which is then configured by the woodpecker autoscaler. An alternative for AWS would be instance templates. Not sure if this would be better.

Regarding build images already being available on the agent: If there is more that one agent probably we cannot guarantee that a single project is always assigned to the same agent, or can we and how? Of course there are labels but if the agents all have the same config?

@anbraten
Copy link
Member

I've already started the basic scaling logic a bit. Most interestingly is probably the code how I calculate the diff for new / less agents at the moment (not sure if it is the best approach): https://github.com/woodpecker-ci/autoscaler/blob/457c0d0545157c03a91829ef844e9d6b322685d2/main.go#L146-L174
Its basically getting the list of free workers (agents * WOODPECKER_MAX_WORKFLOWS) and the list of pending tasks limited by the min and max agents sizes and then tries to add / remove agents to get closer to that amount. Removing agents is currently implemented by setting the do not schedule new tasks flag. At the end some kind of garbage collection is removing all agents which are disabled and do not execute any workflows.

I am not sure how important it is to have all images from the last build already present on the newly started agent but it would be useful I think.

It is "only" some kind of cache then, so it should definitely work without at least in the beginning. But especially in view of a conscious and sustainable use of computing resources we should optimize this flow later on.

For Hetzner I saw that this kubernetes autoscaler is creating an initial snapshot and creates new nodes using that snapshot: https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner#autoscaling-node-pools

I guess in general the kubernetes autoscaler project could provide us some nice insights how to do scaling and how the cloud-providers could be used: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider

CC @windsource

@kolaente
Copy link

Only having skimmed this read, want about forking https://github.com/drone/autoscaler

@anbraten
Copy link
Member

Biggest problem is its license 🙈

@mvdkleijn
Copy link
Contributor

Can we get Linode on the list of cloud providers? 😋

@anbraten
Copy link
Member

The first container image for the autoscaler (atm just with Hetzner being added as the first cloud provider) was just released: woodpecker-ci/autoscaler#1

https://hub.docker.com/r/woodpeckerci/autoscaler

Would be nice if some of you can test it and provide feedback or maybe even start on adding new cloud providers 😉

@guisea
Copy link

guisea commented Aug 30, 2023

@anbraten Love the autoscaler. The core and calculations look sound and do what is needed. Now running a self built image including the linode driver I created. It is doing its thing without issue.

@anbraten
Copy link
Member

@guisea Aweseome. Thanks for the feedback.

@raskyld raskyld mentioned this issue Nov 11, 2023
4 tasks
@anbraten
Copy link
Member

Closing this one as we will track further development in the autoscaler repo: https://github.com/woodpecker-ci/autoscaler

@woodpecker-ci woodpecker-ci locked as resolved and limited conversation to collaborators Feb 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature add new functionality summary it's a summary for lot of issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants