Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add watch request timeout to prevent watch request hang #5732

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

xigang
Copy link
Member

@xigang xigang commented Oct 23, 2024

What type of PR is this?
/kind bug

What this PR does / why we need it:
When the federate-apiserver's watch request to the member cluster gets stuck, it will cause the watch request from the federated client to get stuck as well.

Which issue(s) this PR fixes:
Fixes #5672

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@karmada-bot karmada-bot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 23, 2024
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ikaven1024 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 23, 2024
@xigang xigang changed the title Add timeout for watch requests to member clusters to prevent request … Add watch request timeout to prevent watch request hang Oct 23, 2024
@codecov-commenter
Copy link

codecov-commenter commented Oct 23, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 48.14815% with 14 lines in your changes missing coverage. Please review.

Project coverage is 41.58%. Comparing base (331145f) to head (b47f1d6).
Report is 35 commits behind head on master.

Files with missing lines Patch % Lines
pkg/search/proxy/store/multi_cluster_cache.go 48.14% 13 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5732      +/-   ##
==========================================
+ Coverage   40.90%   41.58%   +0.67%     
==========================================
  Files         650      655       +5     
  Lines       55182    55773     +591     
==========================================
+ Hits        22573    23191     +618     
+ Misses      31171    31076      -95     
- Partials     1438     1506      +68     
Flag Coverage Δ
unittests 41.58% <48.14%> (+0.67%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@xigang
Copy link
Member Author

xigang commented Oct 23, 2024

return nil, err
case <-time.After(30 * time.Second):
// If the watch request times out, return an error, and the client will retry.
return nil, fmt.Errorf("timeout waiting for watch for resource %v in cluster %q", gvr.String(), cluster)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xigang Hi, if a watch request is hanging and causes a timeout, will the hanging watch request continue to exist in the subprocess?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhzhuang-zju Yes, there is this issue. When a watch request times out, the goroutine needs to be terminated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Then that case we have to cancel the context passed to cache.Watch().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this patch intends to terminate the hanging by raising an error after a period of time. Is this the idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question:
Before starting the Watch, we tried to get the cache of that cluster, I'm curious why this cache still exists even after the cluster is gone. Do we have a chance to clean the cache?

cache := c.cacheForClusterResource(cluster, gvr)
if cache == nil {
continue
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question: Before starting the Watch, we tried to get the cache of that cluster, I'm curious why this cache still exists even after the cluster is gone. Do we have a chance to clean the cache?

cache := c.cacheForClusterResource(cluster, gvr)
if cache == nil {
continue
}

@RainbowMango When the member cluster goes offline but the Cluster resources in the control plane are not deleted, it can prevent the offline clusters in the ResourceRegistry from being removed, resulting in the resource cache being retained for a short time.

Copy link
Member Author

@xigang xigang Oct 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xigang Hi, if a watch request is hanging and causes a timeout, will the hanging watch request continue to exist in the subprocess?

@RainbowMango @zhzhuang-zju Fixed, please take a look.

@xigang xigang force-pushed the bugfix/watch branch 2 times, most recently from 21c10d3 to 5a55ca2 Compare October 27, 2024 02:50
@xigang
Copy link
Member Author

xigang commented Oct 28, 2024

/retest

return nil, err
case <-time.After(30 * time.Second):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems wait 30s for each cluster. Should we wait all clusters paralleled?

Copy link
Member Author

@xigang xigang Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems wait 30s for each cluster. Should we wait all clusters paralleled?

@ikaven1024 There’s no issue here; as long as a single cache.Watch times out, the Watch request will return with an error and end. There’s no problem with that here.😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems wait 30s for each cluster. Should we wait all clusters paralleled?

@ikaven1024 There’s no issue here; as long as a single cache.Watch times out, the Watch request will return with an error and end. There’s no problem with that here.😄

While if every cluster create watching takes 20s, not timeout, the total time spends 20s * N.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems wait 30s for each cluster. Should we wait all clusters paralleled?

@ikaven1024 There’s no issue here; as long as a single cache.Watch times out, the Watch request will return with an error and end. There’s no problem with that here.😄

While if every cluster create watching takes 20s, not timeout, the total time spends 20s * N.

@ikaven1024 I understand, but the code in the current release also traverses all the clusters to create watches, which takes N*M time.

errChan := make(chan error, 1)

go func(cluster string) {
w, err := cache.Watch(ctx, options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this watcher is created after 30s, then it seems no way to stop it, is it leak?

Copy link
Member Author

@xigang xigang Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the watcher times out after 30 seconds during creation, it will trigger a time.After timeout, return an error, and call cancel to stop the watcher goroutine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@xigang
Copy link
Member Author

xigang commented Nov 6, 2024

@RainbowMango @ikaven1024 @zhzhuang-zju Our production environment is running without issues; could you take a look to see if the code can be merged?

@zhzhuang-zju
Copy link
Contributor

zhzhuang-zju commented Nov 7, 2024

ctx, cancel := context.WithTimeout(ctx, timeout)
defer func() { cancel() }()
watcher, err := rw.Watch(ctx, &opts)
if err != nil {
scope.err(err, w, req)
return
}

@xigang The context used when establishing a watch request on the cache also has a timeout. Can we directly utilize this context to prevent the watch request from hanging?

timeout := time.Duration(0)
if opts.TimeoutSeconds != nil {
timeout = time.Duration(*opts.TimeoutSeconds) * time.Second
}
if timeout == 0 && minRequestTimeout > 0 {
timeout = time.Duration(float64(minRequestTimeout) * (rand.Float64() + 1.0))
}

proxyCtl, err = proxy.NewController(proxy.NewControllerOption{
RestConfig: serverConfig.ClientConfig,
RestMapper: restMapper,
KubeFactory: serverConfig.SharedInformerFactory,
KarmadaFactory: factory,
MinRequestTimeout: time.Second * time.Duration(serverConfig.Config.MinRequestTimeout),

The timeout for the context is actually initialized when setting up search.Config, with a value of 1800 seconds. It might be worth considering setting this to 30 seconds instead.

This context represents the sum of execution times of multiple functions, not only watch

Comment on lines +346 to +349
select {
case errChan <- fmt.Errorf("failed to start watch for resource %v in cluster %q: %v", gvr.String(), cluster, err):
case <-ctx.Done():
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can case <-ctx.Done() be reached? And is it necessary to respond to the ctx.Done event?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main purpose here is to handle edge cases. There won’t be any issues even if there is no case <-ctx.Done().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If removing the case <-ctx.Done() statement does not affect the actual functionality, it is recommended to remove it to improve code readability.

@xigang
Copy link
Member Author

xigang commented Nov 7, 2024

ctx, cancel := context.WithTimeout(ctx, timeout)
defer func() { cancel() }()
watcher, err := rw.Watch(ctx, &opts)
if err != nil {
scope.err(err, w, req)
return
}

@xigang The context used when establishing a watch request on the cache also has a timeout. Can we directly utilize this context to prevent the watch request from hanging?

timeout := time.Duration(0)
if opts.TimeoutSeconds != nil {
timeout = time.Duration(*opts.TimeoutSeconds) * time.Second
}
if timeout == 0 && minRequestTimeout > 0 {
timeout = time.Duration(float64(minRequestTimeout) * (rand.Float64() + 1.0))
}

proxyCtl, err = proxy.NewController(proxy.NewControllerOption{
RestConfig: serverConfig.ClientConfig,
RestMapper: restMapper,
KubeFactory: serverConfig.SharedInformerFactory,
KarmadaFactory: factory,
MinRequestTimeout: time.Second * time.Duration(serverConfig.Config.MinRequestTimeout),

The timeout for the context is actually initialized when setting up search.Config, with a value of 1800 seconds. It might be worth considering setting this to 30 seconds instead.
This context represents the sum of execution times of multiple functions, not only watch

@zhzhuang-zju The timeout for client-go watch requests is around 10 minutes. The default value of minRequestTimeout is 1800 seconds, and if the client specifies a timeout, minRequestTimeout will not be used. What issues might arise if you set this Context for federation watch and all member clusters' Watch?

@RainbowMango
Copy link
Member

Can someone tell me how to reproduce this issue? I'd like to reproduce and verify the patch on my side.

@zhzhuang-zju
Copy link
Contributor

@zhzhuang-zju What issues might arise if you set this Context for federation watch and all member clusters' Watch?

Sorry, please ignore the comment. Initially, I thought about setting the timeout for the context to 30 seconds to solve the issue of the watch request hanging. And then, I realized that this approach was not appropriate, so I commented it out.

@xigang
Copy link
Member Author

xigang commented Nov 7, 2024

Can someone tell me how to reproduce this issue? I'd like to reproduce and verify the patch on my side.

In our scenario, a member cluster is taken offline, and all nodes within the cluster are shut down. At this point, the cluster name still exists in the ResourceRegistry, so the Pod Cacher also remains. Additionally, during the cluster offline, client Watch long connections continue to access the federate-apiserver.

As a result, the client’s Watch requests end up hanging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Watch Request Blocked When Member Cluster Offline
6 participants