-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart of pulumi operator leader causes unreleased stack locks #368
Comments
Thanks for opening the issue @YuriiVarvynets. We do use the lock manager in the controller to make sure only one controller is running at a time: https://github.com/pulumi/pulumi-kubernetes-operator/blob/master/cmd/manager/main.go#L123-L130. Perhaps you will need to bump up the graceful shutdown timeout on your pods if your stack updates take a long time? Would you be able to dump some logs from the operator around the upgrade? You might also want to enable debug logging (--zap-level=debug) and provide the logs for the replacement pod. |
Thank you, @viveklak for taking a look at this. I enabled debug, but it takes time to reproduce the issue.
How can I increase leaseDurationSeconds for this lock. As default of 15 seconds is too low for my use case.
I have 5 stacks with basic S3 apps: https://github.com/pulumi/examples/tree/master/aws-py-s3-folder |
From the log with debug log level:
When this happens pulumi backend lock may not be released. |
Thanks for the details @YuriiVarvynets. Adding @squaremo for thoughts.
Yes that is correct. I won't consider myself a controller-runtime locking expert, but it appears that failing to renew might result in the graceful shutdown period to be skipped? https://github.com/kubernetes-sigs/controller-runtime/blob/8da9760581ed4f43eee9c2f63764c1cbe0cd3104/pkg/manager/internal.go#L632 I can think of a couple of things we should probably do:
|
|
I see these factors:
The first item is a problem by itself, and the second makes it worse. But I am missing why the lease fails to be renewed. @YuriiVarvynets can you explain why "default of 15 seconds is too low for my use case", and if that's the cause of the failed lease renewal -- e.g., are you running the operator in an environment where making an update every 15 seconds is unrealistic? |
To be clear, I think it would be OK to have a longer |
Update every 15 seconds is okay from infrastructure/load point of view. Right now, I see the issue several times during the week, which does not allow me to use pulumi operator in production. |
Are there any estimates on resolving this issue ? |
This comment has been minimized.
This comment has been minimized.
Added to epic #586 |
Good news everyone, we just release a preview of Pulumi Kubernetes Operator v2. This new release has a whole-new architecture that uses pods as the execution environment, and effectively decouples the lifecycle of the operator from the stack's pod. Also, when a stack's pod is terminated (for whatever reason), we proactively unlock the stack. Please read the announcement blog post for more information: Would love to hear your feedback! Feel free to engage with us on the #kubernetes channel of the Pulumi Slack workspace. |
What happened?
If pulumi operator pod is restarted by any reason (for example, during pulumi operator upgrade process) it does not unlock the stack if it was processed during the restart.
Where
pulumi-kubernetes-operator-6677a05e-5955496654-8x2mh
pod does not exist anymore.Configuration:
Steps to reproduce
Expected Behavior
Locks are released before pod restart.
Actual Behavior
Locks are NOT released and next operator leader can not perform any actions with Stacks.
Output of
pulumi about
No response
Additional context
No response
Contributing
Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).
The text was updated successfully, but these errors were encountered: