Restart of pulumi operator leader causes unreleased stack locks #368

YuriiVarvynets · 2022-11-05T00:09:09Z

What happened?

If pulumi operator pod is restarted by any reason (for example, during pulumi operator upgrade process) it does not unlock the stack if it was processed during the restart.

error: the stack is currently locked by 1 lock(s). Either wait for the other process(es) to end or delete the lock file with `pulumi cancel`.
  azblob://state/.pulumi/locks/pulumi-operator.dev.global/cb307d24-6243-4d1e-9470-165b0aac381e.json: created by pulumi-kubernetes-operator@pulumi-kubernetes-operator-6677a05e-5955496654-8x2mh (pid 30553) at 2022-11-04T22:06:59Z

Where pulumi-kubernetes-operator-6677a05e-5955496654-8x2mh pod does not exist anymore.

Configuration:

Custom state backend on Azure.
Operator configuration same as deploy-operator-py
Stack configuration:

def createStack(name, metadata, stackProjectRepo, repoDir, envRefs):
    k8s.apiextensions.CustomResource(
        resource_name = name,
        api_version = 'pulumi.com/v1',
        kind = 'Stack',
        metadata=metadata,
        spec = {
            "stack": name,
            "backend": "azblob://state",
            "projectRepo": stackProjectRepo,
            "repoDir": repoDir,
            "gitAuth": {
                "accessToken": {
                    "type": "Secret",
                    "secret": {
                            "name": "git-tkn",
                            "key": "accessToken",
                        },
                }
            },
            "branch": "refs/heads/master",
            "refresh": True,
            "continueResyncOnCommitMatch": True,
            "resyncFrequencySeconds": 120,
            "envRefs": envRefs,
            "destroyOnFinalize": False,
        },
    );

Steps to reproduce

Deploy multiple Pulumi operator Stacks with:

           "refresh": True,
            "continueResyncOnCommitMatch": True,
            "resyncFrequencySeconds": 120,

Restart leader pod (or upgrade version of operator) when stacks are processed and there is a lock in the backend.

Expected Behavior

Locks are released before pod restart.

Actual Behavior

Locks are NOT released and next operator leader can not perform any actions with Stacks.

Output of `pulumi about`

No response

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

The text was updated successfully, but these errors were encountered:

viveklak · 2022-11-07T18:58:36Z

Thanks for opening the issue @YuriiVarvynets. We do use the lock manager in the controller to make sure only one controller is running at a time: https://github.com/pulumi/pulumi-kubernetes-operator/blob/master/cmd/manager/main.go#L123-L130.

Perhaps you will need to bump up the graceful shutdown timeout on your pods if your stack updates take a long time?

Would you be able to dump some logs from the operator around the upgrade? You might also want to enable debug logging (--zap-level=debug) and provide the logs for the replacement pod.
cc @squaremo in case I am missing something obvious.

YuriiVarvynets · 2022-11-09T01:18:13Z

Thank you, @viveklak for taking a look at this. I enabled debug, but it takes time to reproduce the issue.

We do use the lock manager in the controller to make sure only one controller is running at a time: https://github.com/pulumi/pulumi-kubernetes-operator/blob/master/cmd/manager/main.go#L123-L130.

How can I increase leaseDurationSeconds for this lock. As default of 15 seconds is too low for my use case.

# kubectl get leases.coordination.k8s.io pulumi-kubernetes-operator-lock -o yaml -n pulumi
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  ...
spec:
  leaseDurationSeconds: 15
  ...

Perhaps you will need to bump up the graceful shutdown timeout on your pods if your stack updates take a long time?

I have 5 stacks with basic S3 apps: https://github.com/pulumi/examples/tree/master/aws-py-s3-folder
Graceful shutdown is set to default 5 minutes that should be enough for such a simple app to complete. However, if stack process time is higher than 5 minutes and with possible operator restart during this time we will get lock not released, am I right? What are other possible options to avoid such situation?

YuriiVarvynets · 2022-11-14T23:10:15Z

From the log with debug log level:

E1114 11:27:21.821634       1 leaderelection.go:325] error retrieving resource lock pulumi/pulumi-kubernetes-operator-lock: Get "https://172.16.0.1:443/apis/coordination.k8s.io/v1/namespaces/pulumi/leases/pulumi-kubernetes-operator-lock": context deadline exceeded
I1114 11:27:21.821759       1 leaderelection.go:278] failed to renew lease pulumi/pulumi-kubernetes-operator-lock: timed out waiting for the condition
{"level":"debug","ts":"2022-11-14T11:27:21.821Z","logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"ConfigMap","namespace":"pulumi","name":"pulumi-kubernetes-operator-lock","uid":"3df0dd70-932c-4a6e-ab8d-3ba09c372966","apiVersion":"v1","resourceVersion":"327159499"},"reason":"LeaderElection","message":"pulumi-kubernetes-operator-d5cea0e0-69d96cd79b-wtrvm_5a56dd4a-385d-4276-89f3-305db266afa6 stopped leading"}
{"level":"error","ts":"2022-11-14T11:27:21.821Z","logger":"cmd","msg":"Manager exited non-zero","error":"leader election lost","stacktrace":"main.main\n\t/home/runner/work/pulumi-kubernetes-operator/pulumi-kubernetes-operator/cmd/manager/main.go:179\nruntime.main\n\t/opt/hostedtoolcache/go/1.19.2/x64/src/runtime/proc.go:250"}
{"level":"debug","ts":"2022-11-14T11:27:21.821Z","logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"Lease","apiVersion":"coordination.k8s.io/v1"},"reason":"LeaderElection","message":"pulumi-kubernetes-operator-d5cea0e0-69d96cd79b-wtrvm_5a56dd4a-385d-4276-89f3-305db266afa6 stopped leading"}
{"level":"info","ts":"2022-11-14T11:27:21.821Z","logger":"controller-runtime.manager.controller.stack-controller","msg":"Shutdown signal received, waiting for all workers to finish"}

When this happens pulumi backend lock may not be released.

viveklak · 2022-11-15T04:16:24Z

Thanks for the details @YuriiVarvynets. Adding @squaremo for thoughts.

However, if stack process time is higher than 5 minutes and with possible operator restart during this time we will get lock not released, am I right?

Yes that is correct. I won't consider myself a controller-runtime locking expert, but it appears that failing to renew might result in the graceful shutdown period to be skipped? https://github.com/kubernetes-sigs/controller-runtime/blob/8da9760581ed4f43eee9c2f63764c1cbe0cd3104/pkg/manager/internal.go#L632

I can think of a couple of things we should probably do:

add more control to lock manager parameters, e.g. leaseDurationSeconds
Some sort of a control mechanism to run a pulumi cancel as a prerequisite for a stack if necessary? I do think this could be risky and we would need to consider some safety valves here.

YuriiVarvynets · 2022-11-15T04:55:41Z

pulumi cancel is risky and should be optional. However, it make sense for me and for others as I was thinking of implementing some automatic tool that would clean up locked stacks anyway.

squaremo · 2022-11-15T13:10:59Z

I see these factors:

if the controller is shutting down, but a stack is still in progress after the graceful shutdown period has passed, the (Pulumi) backend keeps it locked; which means when it's revisited, it can't make progress
failing to renew the lease will cause the operator to exit with no graceful shutdown period

The first item is a problem by itself, and the second makes it worse.

But I am missing why the lease fails to be renewed. @YuriiVarvynets can you explain why "default of 15 seconds is too low for my use case", and if that's the cause of the failed lease renewal -- e.g., are you running the operator in an environment where making an update every 15 seconds is unrealistic?

squaremo · 2022-11-15T13:34:10Z

To be clear, I think it would be OK to have a longer leaseDurationSeconds. It means less sensitivity to failures (it takes longer for a failure to be detected and for another leader to take over), but since stacks typically deploy in O(minutes), it would not be unreasonable to also take minutes to recover.

YuriiVarvynets · 2022-11-15T19:39:29Z

Update every 15 seconds is okay from infrastructure/load point of view. Right now, I see the issue several times during the week, which does not allow me to use pulumi operator in production.
If adding leaseDurationSeconds and pulumi cancel as configurable option will minimize the possibility of the issue it will be great.

YuriiVarvynets · 2022-12-07T20:50:15Z

Are there any estimates on resolving this issue ?

cleverguy25 · 2024-08-23T19:08:30Z

Added to epic #586

EronWright · 2024-10-23T21:31:50Z

Good news everyone, we just release a preview of Pulumi Kubernetes Operator v2. This new release has a whole-new architecture that uses pods as the execution environment, and effectively decouples the lifecycle of the operator from the stack's pod. Also, when a stack's pod is terminated (for whatever reason), we proactively unlock the stack.

Please read the announcement blog post for more information:
https://www.pulumi.com/blog/pulumi-kubernetes-operator-2-0/

Would love to hear your feedback! Feel free to engage with us on the #kubernetes channel of the Pulumi Slack workspace.
cc @YuriiVarvynets @fsismondi

YuriiVarvynets added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Nov 5, 2022

viveklak added awaiting-feedback Blocked on input from the author and removed needs-triage Needs attention from the triage team labels Nov 7, 2022

viveklak assigned squaremo and roothorp Nov 15, 2022

viveklak removed the awaiting-feedback Blocked on input from the author label Nov 15, 2022

mnlumi unassigned roothorp Jan 27, 2023

This comment has been minimized.

Sign in to view

squaremo removed their assignment Mar 29, 2023

spender0 mentioned this issue Nov 3, 2023

Improve architecture for horizontal scaling #515

Closed

EronWright added this to Pulumi Kubernetes Operator v2 Aug 23, 2024

cleverguy25 mentioned this issue Aug 23, 2024

[Epic] Kubernetes Operator Core Functionality Enhancements (PKOv2) #586

Closed

5 tasks

EronWright self-assigned this Oct 23, 2024

EronWright added the resolution/fixed This issue was fixed label Oct 23, 2024

EronWright closed this as completed Oct 23, 2024

github-project-automation bot moved this to Done in Pulumi Kubernetes Operator v2 Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart of pulumi operator leader causes unreleased stack locks #368

Restart of pulumi operator leader causes unreleased stack locks #368

YuriiVarvynets commented Nov 5, 2022 •

edited

Loading

viveklak commented Nov 7, 2022

YuriiVarvynets commented Nov 9, 2022

YuriiVarvynets commented Nov 14, 2022

viveklak commented Nov 15, 2022 •

edited

Loading

YuriiVarvynets commented Nov 15, 2022

squaremo commented Nov 15, 2022 •

edited

Loading

squaremo commented Nov 15, 2022

YuriiVarvynets commented Nov 15, 2022

YuriiVarvynets commented Dec 7, 2022

This comment has been minimized.

cleverguy25 commented Aug 23, 2024

EronWright commented Oct 23, 2024

Restart of pulumi operator leader causes unreleased stack locks #368

Restart of pulumi operator leader causes unreleased stack locks #368

Comments

YuriiVarvynets commented Nov 5, 2022 • edited Loading

What happened?

Steps to reproduce

Expected Behavior

Actual Behavior

Output of pulumi about

Additional context

Contributing

viveklak commented Nov 7, 2022

YuriiVarvynets commented Nov 9, 2022

YuriiVarvynets commented Nov 14, 2022

viveklak commented Nov 15, 2022 • edited Loading

YuriiVarvynets commented Nov 15, 2022

squaremo commented Nov 15, 2022 • edited Loading

squaremo commented Nov 15, 2022

YuriiVarvynets commented Nov 15, 2022

YuriiVarvynets commented Dec 7, 2022

This comment has been minimized.

cleverguy25 commented Aug 23, 2024

EronWright commented Oct 23, 2024

YuriiVarvynets commented Nov 5, 2022 •

edited

Loading

Output of `pulumi about`

viveklak commented Nov 15, 2022 •

edited

Loading

squaremo commented Nov 15, 2022 •

edited

Loading