Autoscaler #999

kdumontnu · 2022-06-22T17:45:34Z

anbraten · 2022-06-22T22:34:05Z

Some questions we have to answer in general:

Is there maybe a go package Woodpecker could use or would it require to write some extendable system to support all kind of hosters?
Which metrics does it need to decide about rescaling.

mr337 · 2022-06-30T23:48:58Z

I'm not for sure how relevant this is, but BuildKite [1] has something similar for AWS but leverages external tooling to do so. The BuildKite service has an API that an AWS Lambda polls every minute and asks for pending jobs. If there are pending jobs it increase the count on an autoscale group. A new instance is fired, agent connects, and puck up the pending job. That handles the autoscaling increment operation.

The other part is if the agent is idle for X minutes it will then shut itself down, as part of shutting down it will decrement the autoscale group.

I think this design is pretty nice as it has the benefits of:

The woodpecker ci project doesn't need to make modifications to talk to all the different clouds, exposes a pending "steps" endpoint and nothing more
The per cloud implementation is outside of the codebase and could be supported in a Lambda for AWS, or Function in Azure, I think that is the name, or which ever method for the relative cloud. Or even the same machine as the server as another container.

[1] - https://github.com/buildkite/buildkite-agent-scaler

kdumontnu · 2022-07-01T13:43:50Z

Some questions we have to answer in general:

Is there maybe a go package Woodpecker could use or would it require to write some extendable system to support all kind of hosters?

I'm not sure about a unified package, but I don't think wrapping the various go packages from different providers should be terribly difficult (ex. google.golang.org/api/googleapi, github.com/aws/aws-sdk-go/aws, ....). I think if we build an autoscaler for one or two services, the community could very easily add support for others as needed.

edit: maybe using terraform could help https://github.com/hashicorp/go-tfe

Which metrics does it need to decide about rescaling.

From woodpecker host it would need:

number pending runs

For setup configs it would need:

min ttl for agents
min number of agents
max number of agents
instance type
instance image (we can set a default here)
authentication information for provider
tags for provider? (this is optional, but most providers allow you to tag instances)

From the provider, it will need to poll/sync:

number available agents
number of running agents

Hopefully this helps (I'm probably missing a lot of interfaces, but maybe we can update this as we learn more)

lafriks · 2022-07-15T20:27:28Z

This should probably be created as separate service/repo to make it more easy to maintain and not to bring too many dependencies into core

6543 · 2022-09-24T21:17:53Z

how to do it externally :) -> https://github.com/windsource/picus

6543 · 2022-10-20T01:29:58Z

I would say we should wait for #1189 ... and if we have it we can calculate if and when we would need to start or stop new agent instances ...

windsource · 2023-01-22T11:57:17Z

In order to have an external autoscaler service, two things are currently missing in Woodpecker from my point of view:

Detect those agents that are idle

In case the external autoscaler as spinned up several agents and there are later less pending jobs, then the autoscaler should stop one or more agents. But in this case those agents need to be detected that currently do not have a job running. As the agents do not have an API to use from an external service (I think) the Woodpecker server is the only point of contact. I have already checked the API /api/queue/info but it does not provide any information about the agent host where a pipeline is running. Could that be information be retrieved from extern in any other way?

Mark agent to not schedule jobs anymore

Before an agent is stopped the server should be told not to schedule any more jobs on that agent. In case this is not done there could be a race condition such that there are no jobs running on that agent, the external autoscaler stops the agent but the server has already scheduled a job on that agent in the meantime. That would cause the job being interrupted until it runs in a timeout (?). Could a corresponding API be added to the Woodpecker server (or agent)?

Note: I am the author of Picus and currently thinking about to extend Picus to a real autoscaler for Woodpecker.

anbraten · 2023-01-22T12:21:02Z

@windsource Great suggestions. I started to add an agents list in #1189 which would be the first step into that direction to allow the server to identify an agent. (I am mainly needing another maintainer to review this)

After that PR got merged adding a link between agents and queue entries should be easily doable. The do not schedule flag is a great idea as well. Would also be possible after to add after #1189.

windsource · 2023-01-23T11:21:46Z

An other feature might be useful here as well:

Assign priorities to agents

I wonder how Woodpecker server schedules jobs on agents when there is more than one agent available. Let's assume the following situation: There are 3 static agents (maybe on premise) already connected and the external autoscaler is able to dynamically start more agents on high load situations in the cloud. When load gets less the remaining jobs should be scheduled on the static agents with priority such that the cloud agents can be stopped again. When the user can provide a parameter 'priority' to each agent that could be possible.

@anbraten what do you think about that?

anbraten · 2023-01-28T14:51:25Z

@windsource Not sure if that is really needed. At least for now I would put it on the long ideas list 😉

gmuellerinform · 2023-02-21T13:03:58Z

Hi! Since woodpecker is already using containers for everything, why not simply use containers for autoscaling too? Call the autoscaling container for every pending and finished job and provide some info, e.g. total number of pending jobs etc. as env variables. The container can then decide by itself what to do. I would like to call my terraform scripts for example. Others might just call some web api. But this way there could be autoscaling plugins. Just an idea..

anbraten · 2023-03-21T07:42:47Z

FYI with #1631 we should soon be able to properly know which agent has nothing to do and can be removed.

anbraten · 2023-03-21T07:50:44Z

I will start creating an external autoscaler service written in golang (to be able to share some code) over at https://github.com/woodpecker-ci/autoscaler. If anyone is interesting in helping pls reach out to me.

@windsource I really like what you started with picus ❤️ , maybe you are interested in working on the go implementation as well.

Save which agent is running a task. This is now visible in the admin UI in the queue and in the agent details screen. # changes - [x] save id of agent executing a task - [x] add endpoint to get tasks of an agent for #999 - [x] show assigned agent-id in queue - [x] (offtopic) use same colors for queue stats and icons (similar to the ones used by pipelines) - [x] (offtopic) use badges for queue labels & dependencies ![image](https://user-images.githubusercontent.com/6918444/226541271-23f3b7b2-7a08-45c2-a2e6-1c7fc31b6f1d.png)

6543 · 2023-03-22T01:42:04Z

well I was going to implement it as project for my bachelor thesis this semseter ...

windsource · 2023-03-23T09:07:59Z

Hi @anbraten, that's good news. Currently Picus is only able to scale a single agent up and down but I already thought about how to handle more agents but did not have the time to implement that yet. Maybe we can exchange some ideas about the autoscaler in golang.

The tricky part is startig the agent. In Picus I used different methods dependent on the cloud provider. In AWS you do no pay for stopped instances (except for the block storage) so I created one instance at beginning which is then started and stopped by the autoscaler. The advantage of that solution is, that build images are already present in the block storage and do not need to be pulled again. Also the agent gets started very quick. The disadvantage is, that it does not scale to more than one agent (except if you have prepared multiple agents like that). I am not sure how important it is to have all images from the last build already present on the newly started agent but it would be useful I think.

For Hetzner cloud the autoscaler starts a new instance and I use cloud init to setup the instance with docker compose and and the woodpecker agent image. In that case all images (also build images) need be pulled when the agent is started. That also works quite well but puts some load on the container registries.

For AWS that method could be used as well to scale for more than one agent. One could pre-generate an AMI and use an autoscaling group which is then configured by the woodpecker autoscaler. An alternative for AWS would be instance templates. Not sure if this would be better.

Regarding build images already being available on the agent: If there is more that one agent probably we cannot guarantee that a single project is always assigned to the same agent, or can we and how? Of course there are labels but if the agents all have the same config?

anbraten · 2023-03-25T06:52:18Z

I've already started the basic scaling logic a bit. Most interestingly is probably the code how I calculate the diff for new / less agents at the moment (not sure if it is the best approach): https://github.com/woodpecker-ci/autoscaler/blob/457c0d0545157c03a91829ef844e9d6b322685d2/main.go#L146-L174
Its basically getting the list of free workers (agents * WOODPECKER_MAX_WORKFLOWS) and the list of pending tasks limited by the min and max agents sizes and then tries to add / remove agents to get closer to that amount. Removing agents is currently implemented by setting the do not schedule new tasks flag. At the end some kind of garbage collection is removing all agents which are disabled and do not execute any workflows.

I am not sure how important it is to have all images from the last build already present on the newly started agent but it would be useful I think.

It is "only" some kind of cache then, so it should definitely work without at least in the beginning. But especially in view of a conscious and sustainable use of computing resources we should optimize this flow later on.

For Hetzner I saw that this kubernetes autoscaler is creating an initial snapshot and creates new nodes using that snapshot: https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner#autoscaling-node-pools

I guess in general the kubernetes autoscaler project could provide us some nice insights how to do scaling and how the cloud-providers could be used: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider

CC @windsource

kolaente · 2023-03-25T08:24:11Z

Only having skimmed this read, want about forking https://github.com/drone/autoscaler

anbraten · 2023-03-25T08:27:39Z

Biggest problem is its license 🙈

mvdkleijn · 2023-03-28T15:37:17Z

Can we get Linode on the list of cloud providers? 😋

anbraten · 2023-05-14T20:28:38Z

The first container image for the autoscaler (atm just with Hetzner being added as the first cloud provider) was just released: woodpecker-ci/autoscaler#1

https://hub.docker.com/r/woodpeckerci/autoscaler

Would be nice if some of you can test it and provide feedback or maybe even start on adding new cloud providers 😉

guisea · 2023-08-30T10:25:14Z

@anbraten Love the autoscaler. The core and calculations look sound and do what is needed. Now running a self built image including the linode driver I created. It is doing its thing without issue.

anbraten · 2023-08-30T10:26:13Z

@guisea Aweseome. Thanks for the feedback.

anbraten · 2024-02-12T20:32:59Z

Closing this one as we will track further development in the autoscaler repo: https://github.com/woodpecker-ci/autoscaler

kdumontnu added the pending:feature label Jun 22, 2022

6543 mentioned this issue Aug 24, 2022

Add basic GCP autoscaler #117

Closed

6543 added feature add new functionality and removed pending:feature labels Sep 24, 2022

6543 added the summary it's a summary for lot of issues label Oct 25, 2022

qwerty287 mentioned this issue Oct 25, 2022

Add initial Autoscaler support #1333

Closed

anbraten mentioned this issue Nov 26, 2022

Roadmap #869

Closed

31 tasks

6543 mentioned this issue Jan 2, 2023

Add my small autoscaler project to awesome list #1517

Merged

6543 mentioned this issue Feb 15, 2023

Show expected pipeline/step run time #1563

Open

3 tasks

anbraten mentioned this issue Mar 21, 2023

Save agent-id for tasks and add endpoint to get agent tasks #1631

Merged

6 tasks

raskyld mentioned this issue Nov 11, 2023

Per-labelset queues #2803

Closed

4 tasks

anbraten closed this as completed Feb 12, 2024

woodpecker-ci locked as resolved and limited conversation to collaborators Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaler #999

Autoscaler #999

kdumontnu commented Jun 22, 2022 •

edited by qwerty287

Loading

anbraten commented Jun 22, 2022

mr337 commented Jun 30, 2022

kdumontnu commented Jul 1, 2022 •

edited

Loading

lafriks commented Jul 15, 2022

6543 commented Sep 24, 2022

6543 commented Oct 20, 2022

windsource commented Jan 22, 2023 •

edited

Loading

anbraten commented Jan 22, 2023

windsource commented Jan 23, 2023

anbraten commented Jan 28, 2023

gmuellerinform commented Feb 21, 2023

anbraten commented Mar 21, 2023

anbraten commented Mar 21, 2023

6543 commented Mar 22, 2023

windsource commented Mar 23, 2023

anbraten commented Mar 25, 2023

kolaente commented Mar 25, 2023

anbraten commented Mar 25, 2023

mvdkleijn commented Mar 28, 2023

anbraten commented May 14, 2023

guisea commented Aug 30, 2023

anbraten commented Aug 30, 2023

anbraten commented Feb 12, 2024

Autoscaler #999

Autoscaler #999

Comments

kdumontnu commented Jun 22, 2022 • edited by qwerty287 Loading

Clear and concise description of the problem

Suggested solution

Roadmap:

Alternative

Additional context

Validations

anbraten commented Jun 22, 2022

mr337 commented Jun 30, 2022

kdumontnu commented Jul 1, 2022 • edited Loading

lafriks commented Jul 15, 2022

6543 commented Sep 24, 2022

6543 commented Oct 20, 2022

windsource commented Jan 22, 2023 • edited Loading

anbraten commented Jan 22, 2023

windsource commented Jan 23, 2023

anbraten commented Jan 28, 2023

gmuellerinform commented Feb 21, 2023

anbraten commented Mar 21, 2023

anbraten commented Mar 21, 2023

6543 commented Mar 22, 2023

windsource commented Mar 23, 2023

anbraten commented Mar 25, 2023

kolaente commented Mar 25, 2023

anbraten commented Mar 25, 2023

mvdkleijn commented Mar 28, 2023

anbraten commented May 14, 2023

guisea commented Aug 30, 2023

anbraten commented Aug 30, 2023

anbraten commented Feb 12, 2024

kdumontnu commented Jun 22, 2022 •

edited by qwerty287

Loading

kdumontnu commented Jul 1, 2022 •

edited

Loading

windsource commented Jan 22, 2023 •

edited

Loading