Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix goroutine leak when reconciling #539

Merged
merged 1 commit into from
Aug 22, 2024

Conversation

IxDay
Copy link
Contributor

@IxDay IxDay commented Jun 10, 2024

This patch fixes the propagation of context cancellation through the call stack. It prevents leaks of channel and goroutine from the terraform provider.

Description of your changes

In order to fix this bug we tracked down the leak to the underlying terraform provider. We managed to isolate this function: provider code using pprof.
By adding it to our deployment, we noticed the creation of 2 channels and 2 goroutines on each resource every time the reconciliation is kicking. All the never closing routines had the same stack trace:

image

As we can see the routine is waiting on the closing of the Done channel from the parent "process". However, we see in the controller bootstraping that we are overriding the parent context with WithoutCancellation context. Implementation shows from source code that channel is now nil. And a nil channel will never close and block the goroutine as showcased in this playground demo

Fixes #538

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable to ensure this PR is ready for review.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

This patch fixes the propagation of context cancellation through the
call stack. It prevents leaks of channel and goroutine from the
[terraform provider][provider_code].

Fixes: crossplane-contrib#538

[provider_code]: https://github.com/hashicorp/terraform-provider-google/blob/1d1a50adf64af60815b7a08ffc5e9d3e856d2e9c/google/transport/batcher.go#L117-L123

Signed-off-by: Maxime Vidori <[email protected]>
@mergenci
Copy link
Collaborator

Thanks for the fix @IxDay 🙏 @erhancagirici reminded me that we had used context.withoutCancel(), because propagating ctx would cause early cancelation issues. Have you observed such issues? I'm not sure if they were easily observable, by the way.

@IxDay
Copy link
Contributor Author

IxDay commented Jun 12, 2024

Yes, I was editing my message to ask if there was a reason to choose this, it was not obvious from the code/history/comment.
Do you have any idea how those early cancellation event would materialize? For the moment we are monitoring our claim on their "Readiness" status. Which was flaky when memory was reaching the limit (the underlying provider was not able to issue http calls due to garbage collection pressure). However, since the release of the patch, we did not observe any de-sync yet.
To give you an idea of the timeframe, we rolled the change on Monday on ~150 claims (which are composed of ~2 crossplane gcp provider objects). Then on ~170 more yesterday. For the moment, we did not notice any unexpected behavior.
But once again, we might not be tracking the right thing here, so if you have more details, I would be happy to investigate.

@IxDay
Copy link
Contributor Author

IxDay commented Jun 24, 2024

We are starting to roll this across our entire infrastructure. We still haven't notice any issue yet. I am bumping this channel in order to make this move forward.
Regarding my previous message, do you have any insight to share with me in order to reproduce the previous bug which motivated the introduction of WithoutCancel ?

@mergenci
Copy link
Collaborator

mergenci commented Jul 5, 2024

@IxDay, I've been working on this issue from time to time. The fact that you haven't had any issues is great news. I've scheduled a meeting with a team member next week. They have more experience in the problem. Having a memory leak greatly disturbs me. Now that I've addressed some of my urgent tasks, this issue will be at the top of my priority list.

@mergenci
Copy link
Collaborator

Thank you @ulucinar for providing background information off-channel. We hypothesized that the implementation was ported from Azure provider, which had context cancelation issues when the context was propagated. It is likely that GCP provider would never had any issues, in the first place, even if the context was propagated.

We will run some simple tests on our side. If all goes well, we will merge.

@IxDay
Copy link
Contributor Author

IxDay commented Jul 30, 2024

Let me know if you need anything. We are really looking forward for this PR to be merged

@turkenf
Copy link
Collaborator

turkenf commented Jul 30, 2024

/test-examples="examples/container/v1beta2/cluster.yaml"

@turkenf
Copy link
Collaborator

turkenf commented Jul 30, 2024

/test-examples="examples/cloudplatform/v1beta1/serviceaccount.yaml"

@turkenf
Copy link
Collaborator

turkenf commented Jul 30, 2024

/test-examples="examples/storage/v1beta2/bucket.yaml"

Copy link
Collaborator

@turkenf turkenf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for your effort in this PR @IxDay, LGTM.

Copy link
Collaborator

@mergenci mergenci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @IxDay for your work. Your well-prepared reports made our jobs a lot easier. Most importantly thank you for your patience 🙏 We wish we were able to merge this important PR sooner.

@mergenci mergenci merged commit cbc7f26 into crossplane-contrib:main Aug 22, 2024
12 checks passed
@turkenf
Copy link
Collaborator

turkenf commented Sep 9, 2024

Hi @IxDay,

After this change, we encountered regression with creating and deleting some resources, so we reverted this change. Please see the related issue and PR.

We would be pleased if you could work on resolving this issue and create a new PR to address the regression.

@IxDay
Copy link
Contributor Author

IxDay commented Sep 10, 2024

I will take a look, but I will need a few days before, since I have other priorities at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Memory leak on all providers
3 participants