Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Graceful termination of the operator #2662

Merged

Conversation

caseydavenport
Copy link
Member

@caseydavenport caseydavenport commented May 24, 2023

Description

Allow for graceful termination of the operator to occur when both the operator and the Installation resource have been deleted simultaneously.

This was causing issues such as the following:

I am not sure this is a perfect solution, but in very limited testing it seems to be doing the trick for me!

For PR author

  • Tests for change.
  • If changing pkg/apis/, run make gen-files
  • If changing versions, run make gen-versions

For PR reviewers

A note for code reviewers - all pull requests must have the following:

  • Milestone set according to targeted release.
  • Appropriate labels:
    • kind/bug if this is a bugfix.
    • kind/enhancement if this is a a new feature.
    • enterprise if this PR applies to Calico Enterprise only.

main.go Outdated Show resolved Hide resolved
Copy link
Contributor

@stevehipwell stevehipwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good solution. Would it make sense to add a customisable timeout?

@caseydavenport
Copy link
Member Author

Unfortunately, in testing of this it proves it's not a 100% solution.

The operator waits as expected, but helm happily goes ahead and deletes the clusterrole and binding, preventing the operator from actually doing its work to clean up.

I still need to try helm with --wait and --cascade=foreground though - those flags sound promising when combined with this.

@caseydavenport
Copy link
Member Author

Looks like even with foreground deletion, tigera-operator loses RBAC for the cluster before it can clean up after itself. So, we'll need to make sure the ClusterRole and Binding don't get deleted until the operator deployment itself is gone. I can't seem to find a way to do that in helm itself, so we probably need to resort to even more finalizers 😭

@caseydavenport
Copy link
Member Author

Probably what we should have done from the beginning is manage RBAC outside of helm (like the namespace, and CRDs). But that starts to mean a lot of yaml is managed outside of the chart.

@stevehipwell
Copy link
Contributor

@caseydavenport would finalizers work in this context and if so what would be the increased RBAC surface area?

Alternatively a pre delete Helm hook to delete the installation should work.

@caseydavenport
Copy link
Member Author

Alternatively a pre delete Helm hook to delete the installation should work.

Right, so this would be something along the lines of executing a Job (+ClusterRole +Binding) to delete the Installation prior to actually uninstalling the chart. That seems reasonable, I'll give that a try next.

@stevehipwell
Copy link
Contributor

Right, so this would be something along the lines of executing a Job (+ClusterRole +Binding) to delete the Installation prior to actually uninstalling the chart. That seems reasonable, I'll give that a try next.

You can probably use the RBAC from the chart as a pre-delete hook would be run before they're removed.

@caseydavenport
Copy link
Member Author

It wasn't quite as trivial as I thought it might be, but I did get a pre-delete hook working for a basic Calico + apiserver install, in combination with this PR: projectcalico/calico#7859

One thing with this is that helm won't proceed cleaning up its CRs until the pre-delete hook is complete, and we don't want the pre-delete hook to completed until we're sure the Installation object has been dealt with.

So, my solution is to:

  • have the pre-delete hook delete both the Installation and APIServer CRs, and wait for the Installation to disappear.
  • have the finalizer on the Installation gated on the cleanup of the apiserver pods

@caseydavenport
Copy link
Member Author

have the finalizer on the Installation gated on the cleanup of the apiserver pods

Perhaps a smarter approach would be to have the apiserver controller put its own finalizer on the Installation (and any other controllers that depend on the Installation existing in order to perform their cleanup)

@danudey danudey modified the milestones: v1.31.0, v1.31.1, v1.31.2 Aug 31, 2023

// We need to wait for termination to complete. We can do this by checking if the Installation
// resource has been cleaned up or not.
to := 60 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set terminationGracePeriodSeconds to 60s to match this since the default is 30s?

Copy link
Member

@tmjd tmjd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One question about a change we might want to make to the operator manifest.

Copy link
Member

@tmjd tmjd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@caseydavenport caseydavenport merged commit c3d7b3a into tigera:master Oct 9, 2023
5 checks passed
@caseydavenport caseydavenport deleted the casey-graceful-termination branch October 9, 2023 18:42
@joaobettu
Copy link

Any news about this issue?

@tmjd
Copy link
Member

tmjd commented Apr 26, 2024

@joaobettu hopefully you've seen the the messages on the issues this is addressing. If you're looking for some info specific to this PR please be more specific about what you're looking for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants