Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart of pulumi operator leader causes unreleased stack locks #368

Closed
YuriiVarvynets opened this issue Nov 5, 2022 · 12 comments
Closed
Assignees
Labels
kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed

Comments

@YuriiVarvynets
Copy link

YuriiVarvynets commented Nov 5, 2022

What happened?

If pulumi operator pod is restarted by any reason (for example, during pulumi operator upgrade process) it does not unlock the stack if it was processed during the restart.

error: the stack is currently locked by 1 lock(s). Either wait for the other process(es) to end or delete the lock file with `pulumi cancel`.
  azblob://state/.pulumi/locks/pulumi-operator.dev.global/cb307d24-6243-4d1e-9470-165b0aac381e.json: created by pulumi-kubernetes-operator@pulumi-kubernetes-operator-6677a05e-5955496654-8x2mh (pid 30553) at 2022-11-04T22:06:59Z

Where pulumi-kubernetes-operator-6677a05e-5955496654-8x2mh pod does not exist anymore.

Configuration:

  1. Custom state backend on Azure.
  2. Operator configuration same as deploy-operator-py
  3. Stack configuration:
def createStack(name, metadata, stackProjectRepo, repoDir, envRefs):
    k8s.apiextensions.CustomResource(
        resource_name = name,
        api_version = 'pulumi.com/v1',
        kind = 'Stack',
        metadata=metadata,
        spec = {
            "stack": name,
            "backend": "azblob://state",
            "projectRepo": stackProjectRepo,
            "repoDir": repoDir,
            "gitAuth": {
                "accessToken": {
                    "type": "Secret",
                    "secret": {
                            "name": "git-tkn",
                            "key": "accessToken",
                        },
                }
            },
            "branch": "refs/heads/master",
            "refresh": True,
            "continueResyncOnCommitMatch": True,
            "resyncFrequencySeconds": 120,
            "envRefs": envRefs,
            "destroyOnFinalize": False,
        },
    );

Steps to reproduce

  1. Deploy multiple Pulumi operator Stacks with:
           "refresh": True,
            "continueResyncOnCommitMatch": True,
            "resyncFrequencySeconds": 120,
  1. Restart leader pod (or upgrade version of operator) when stacks are processed and there is a lock in the backend.

Expected Behavior

Locks are released before pod restart.

Actual Behavior

Locks are NOT released and next operator leader can not perform any actions with Stacks.

Output of pulumi about

No response

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@YuriiVarvynets YuriiVarvynets added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Nov 5, 2022
@viveklak
Copy link
Contributor

viveklak commented Nov 7, 2022

Thanks for opening the issue @YuriiVarvynets. We do use the lock manager in the controller to make sure only one controller is running at a time: https://github.com/pulumi/pulumi-kubernetes-operator/blob/master/cmd/manager/main.go#L123-L130.

Perhaps you will need to bump up the graceful shutdown timeout on your pods if your stack updates take a long time?

Would you be able to dump some logs from the operator around the upgrade? You might also want to enable debug logging (--zap-level=debug) and provide the logs for the replacement pod.
cc @squaremo in case I am missing something obvious.

@viveklak viveklak added awaiting-feedback Blocked on input from the author and removed needs-triage Needs attention from the triage team labels Nov 7, 2022
@YuriiVarvynets
Copy link
Author

Thank you, @viveklak for taking a look at this. I enabled debug, but it takes time to reproduce the issue.

We do use the lock manager in the controller to make sure only one controller is running at a time: https://github.com/pulumi/pulumi-kubernetes-operator/blob/master/cmd/manager/main.go#L123-L130.

How can I increase leaseDurationSeconds for this lock. As default of 15 seconds is too low for my use case.

# kubectl get leases.coordination.k8s.io pulumi-kubernetes-operator-lock -o yaml -n pulumi
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  ...
spec:
  leaseDurationSeconds: 15
  ...

Perhaps you will need to bump up the graceful shutdown timeout on your pods if your stack updates take a long time?

I have 5 stacks with basic S3 apps: https://github.com/pulumi/examples/tree/master/aws-py-s3-folder
Graceful shutdown is set to default 5 minutes that should be enough for such a simple app to complete. However, if stack process time is higher than 5 minutes and with possible operator restart during this time we will get lock not released, am I right? What are other possible options to avoid such situation?

@YuriiVarvynets
Copy link
Author

From the log with debug log level:

E1114 11:27:21.821634       1 leaderelection.go:325] error retrieving resource lock pulumi/pulumi-kubernetes-operator-lock: Get "https://172.16.0.1:443/apis/coordination.k8s.io/v1/namespaces/pulumi/leases/pulumi-kubernetes-operator-lock": context deadline exceeded
I1114 11:27:21.821759       1 leaderelection.go:278] failed to renew lease pulumi/pulumi-kubernetes-operator-lock: timed out waiting for the condition
{"level":"debug","ts":"2022-11-14T11:27:21.821Z","logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"ConfigMap","namespace":"pulumi","name":"pulumi-kubernetes-operator-lock","uid":"3df0dd70-932c-4a6e-ab8d-3ba09c372966","apiVersion":"v1","resourceVersion":"327159499"},"reason":"LeaderElection","message":"pulumi-kubernetes-operator-d5cea0e0-69d96cd79b-wtrvm_5a56dd4a-385d-4276-89f3-305db266afa6 stopped leading"}
{"level":"error","ts":"2022-11-14T11:27:21.821Z","logger":"cmd","msg":"Manager exited non-zero","error":"leader election lost","stacktrace":"main.main\n\t/home/runner/work/pulumi-kubernetes-operator/pulumi-kubernetes-operator/cmd/manager/main.go:179\nruntime.main\n\t/opt/hostedtoolcache/go/1.19.2/x64/src/runtime/proc.go:250"}
{"level":"debug","ts":"2022-11-14T11:27:21.821Z","logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"Lease","apiVersion":"coordination.k8s.io/v1"},"reason":"LeaderElection","message":"pulumi-kubernetes-operator-d5cea0e0-69d96cd79b-wtrvm_5a56dd4a-385d-4276-89f3-305db266afa6 stopped leading"}
{"level":"info","ts":"2022-11-14T11:27:21.821Z","logger":"controller-runtime.manager.controller.stack-controller","msg":"Shutdown signal received, waiting for all workers to finish"}

When this happens pulumi backend lock may not be released.

@viveklak
Copy link
Contributor

viveklak commented Nov 15, 2022

Thanks for the details @YuriiVarvynets. Adding @squaremo for thoughts.

However, if stack process time is higher than 5 minutes and with possible operator restart during this time we will get lock not released, am I right?

Yes that is correct. I won't consider myself a controller-runtime locking expert, but it appears that failing to renew might result in the graceful shutdown period to be skipped? https://github.com/kubernetes-sigs/controller-runtime/blob/8da9760581ed4f43eee9c2f63764c1cbe0cd3104/pkg/manager/internal.go#L632

I can think of a couple of things we should probably do:

  1. add more control to lock manager parameters, e.g. leaseDurationSeconds
  2. Some sort of a control mechanism to run a pulumi cancel as a prerequisite for a stack if necessary? I do think this could be risky and we would need to consider some safety valves here.

@viveklak viveklak removed the awaiting-feedback Blocked on input from the author label Nov 15, 2022
@YuriiVarvynets
Copy link
Author

pulumi cancel is risky and should be optional. However, it make sense for me and for others as I was thinking of implementing some automatic tool that would clean up locked stacks anyway.

@squaremo
Copy link
Contributor

squaremo commented Nov 15, 2022

I see these factors:

  1. if the controller is shutting down, but a stack is still in progress after the graceful shutdown period has passed, the (Pulumi) backend keeps it locked; which means when it's revisited, it can't make progress
  2. failing to renew the lease will cause the operator to exit with no graceful shutdown period

The first item is a problem by itself, and the second makes it worse.

But I am missing why the lease fails to be renewed. @YuriiVarvynets can you explain why "default of 15 seconds is too low for my use case", and if that's the cause of the failed lease renewal -- e.g., are you running the operator in an environment where making an update every 15 seconds is unrealistic?

@squaremo
Copy link
Contributor

To be clear, I think it would be OK to have a longer leaseDurationSeconds. It means less sensitivity to failures (it takes longer for a failure to be detected and for another leader to take over), but since stacks typically deploy in O(minutes), it would not be unreasonable to also take minutes to recover.

@YuriiVarvynets
Copy link
Author

Update every 15 seconds is okay from infrastructure/load point of view. Right now, I see the issue several times during the week, which does not allow me to use pulumi operator in production.
If adding leaseDurationSeconds and pulumi cancel as configurable option will minimize the possibility of the issue it will be great.

@YuriiVarvynets
Copy link
Author

Are there any estimates on resolving this issue ?

@sundeepbhatia1989

This comment has been minimized.

@cleverguy25
Copy link

Added to epic #586

@EronWright
Copy link
Contributor

Good news everyone, we just release a preview of Pulumi Kubernetes Operator v2. This new release has a whole-new architecture that uses pods as the execution environment, and effectively decouples the lifecycle of the operator from the stack's pod. Also, when a stack's pod is terminated (for whatever reason), we proactively unlock the stack.

Please read the announcement blog post for more information:
https://www.pulumi.com/blog/pulumi-kubernetes-operator-2-0/

Would love to hear your feedback! Feel free to engage with us on the #kubernetes channel of the Pulumi Slack workspace.
cc @YuriiVarvynets @fsismondi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Projects
No open projects
Status: Done
Development

No branches or pull requests

7 participants