-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve architecture for horizontal scaling #515
Comments
cc @rquitales @EronWright for you awareness. |
How is concurrency limited when handling stacks all changing at the same time? If OP has 40+ stacks and they're all being refreshed/updated at the same time, would some simple concurrency controls smooth the spike out over a longer time? |
I set MAX_CONCURRENT_RECONCILES variable in the operator pod to 4. If I set a higher value, e.g 10, the operator will consume way more resources and will be OOM-killed unless I dedicate even more memory to the pod. This will lead to money burning as most of the time the pod is doing nothing as there are no changes in the stacks. If I leave MAX_CONCURRENT_RECONCILES=4 the update is too slow when all stacks receive a change. |
Good to know for someone new to the operator, was just thinking out loud about the concurrency but it makes sense if the update is too slow too. Given those requirements it does feel like pushing those sessions out to Job pods so they can be on demand distributed out to the wider cluster makes sense. |
Added to epic #586 |
Good news everyone, we just release a preview of Pulumi Kubernetes Operator v2. This new release has a whole-new architecture that provides much better horizontal scalability. Please read the announcement blog post for more information: Would love to hear your feedback! Feel free to engage with us on the #kubernetes channel of the Pulumi Slack workspace. |
Hello!
Issue details
Hello Pulumi team! I've been using Pulumi for years and recently started using Pulumi Kubernetes Operator.
Having 40+ stacks based on the same typescript npm project taken care of by one Pulumi operator installation I found design problems in the operator.
When it runs npm install and Pulumi code for several stacks it consumes a lot of CPU and memory. But this happens only after git changes. So most of the time operator pod is doing nothing when there are no git changes. But I need to have it with proper CPU and Memory requests set to avoid OOM Kill. So the pod's resources are underutilized. It is burning money most of the time.
The problem is partly related to #368
When I set little resources operator got OOMKilled during infra provisioning and the stack state file is locked by concurrent update.
In addition to the resource problem, it is not possible to scale up the operator deployment horizontally to increase the speed of syncing the big number of stacks. Only one pod can work on stacks at one moment, for this reason, there is k8s lease locking.
As a solution, I would decouple the "npm install" and "pulumi up" functionality from the operator pod into a worker pod so the operator could assign the worker pod onto one stack individually to provision it and once the stack is done the worker should die to save costs. The operator pod should be only a controller for stacks and worker pods. This would make Pulumi Operator more scalable to suit big platforms having hundreds or thousands of stacks.
I would be glad to provide additional information, just let me know.
Affected area/feature
The text was updated successfully, but these errors were encountered: