Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xdsclient: update watcher API as per gRFC A88 #7977

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

purnesh42H
Copy link
Contributor

@purnesh42H purnesh42H commented Jan 2, 2025

This is the first part of implementing gRFC A88 (grpc/proposal#466).

This introduces the new watcher API but does not change any of the existing behavior. This table summarizes the API changes and behavior for each case:

Case Old API New API Behavior
Resource timer fires OnResourceDoesNotExist() OnResourceChanged(NOT_FOUND) Fail data plane RPCs
LDS or CDS resource deletion OnResourceDoesNotExist() OnResourceChanged(NOT_FOUND) Drop resource and fail data plane RPCs
xDS channel reports TRANSIENT_FAILURE OnError() OnResourceChanged(status) if resource NOT already cached; OnAmbientError(status) otherwise Continue using cached resource, if any; otherwise, fail data plane RPCs
ADS stream terminates without receiving a response OnError() OnResourceChanged(status) if resource NOT already cached; OnAmbientError(status) otherwise Continue using cached resource, if any; otherwise, fail data plane RPCs
Invalid resource update (client NACK) OnError() OnResourceChanged(status) if resource NOT already cached; OnAmbientError(status) otherwise Continue using cached resource, if any; otherwise, fail data plane RPCs
Valid resource update OnUpdate(resource) OnResourceChanged(resource) use the new resource

RELEASE NOTES:

  • xds: TBD

@purnesh42H purnesh42H added Type: Feature New features or improvements in behavior Area: xDS Includes everything xDS related, including LB policies used with xDS. labels Jan 2, 2025
@purnesh42H purnesh42H added this to the 1.70 Release milestone Jan 2, 2025
@purnesh42H purnesh42H self-assigned this Jan 2, 2025
@purnesh42H purnesh42H force-pushed the a88-watcher-api branch 7 times, most recently from 8d198b7 to a9b45c0 Compare January 3, 2025 16:28
@purnesh42H purnesh42H requested a review from markdroth January 3, 2025 16:58
@purnesh42H purnesh42H assigned dfawley and markdroth and unassigned purnesh42H Jan 3, 2025
@purnesh42H purnesh42H requested a review from dfawley January 3, 2025 16:59
@purnesh42H purnesh42H force-pushed the a88-watcher-api branch 2 times, most recently from 4009e3e to 57dbf23 Compare January 6, 2025 12:03
Copy link

codecov bot commented Jan 6, 2025

Codecov Report

Attention: Patch coverage is 88.80597% with 15 lines in your changes missing coverage. Please review.

Project coverage is 82.00%. Comparing base (6819ed7) to head (479b7d3).

Files with missing lines Patch % Lines
xds/internal/server/listener_wrapper.go 47.05% 7 Missing and 2 partials ⚠️
.../balancer/clusterresolver/resource_resolver_eds.go 82.35% 2 Missing and 1 partial ⚠️
xds/internal/resolver/xds_resolver.go 75.00% 2 Missing ⚠️
xds/internal/testutils/resource_watcher.go 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7977      +/-   ##
==========================================
- Coverage   82.02%   82.00%   -0.02%     
==========================================
  Files         410      410              
  Lines       40233    40250      +17     
==========================================
+ Hits        33000    33007       +7     
- Misses       5865     5872       +7     
- Partials     1368     1371       +3     
Files with missing lines Coverage Δ
internal/xds/bootstrap/bootstrap.go 65.72% <ø> (ø)
xds/internal/balancer/cdsbalancer/cdsbalancer.go 84.59% <100.00%> (-0.04%) ⬇️
...s/internal/balancer/cdsbalancer/cluster_watcher.go 100.00% <100.00%> (ø)
...rnal/balancer/clusterresolver/resource_resolver.go 94.44% <100.00%> (ø)
xds/internal/resolver/watch_service.go 100.00% <100.00%> (+8.57%) ⬆️
xds/internal/server/rds_handler.go 86.81% <100.00%> (-0.29%) ⬇️
xds/internal/xdsclient/authority.go 80.25% <100.00%> (+0.47%) ⬆️
xds/internal/xdsclient/clientimpl_watchers.go 84.81% <100.00%> (+0.39%) ⬆️
...nal/xdsclient/xdsresource/cluster_resource_type.go 77.14% <100.00%> (ø)
...l/xdsclient/xdsresource/endpoints_resource_type.go 75.00% <100.00%> (ø)
... and 7 more

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@purnesh42H purnesh42H force-pushed the a88-watcher-api branch 4 times, most recently from 43c9adb to 89f475a Compare January 6, 2025 12:26
@arjan-bal arjan-bal modified the milestones: 1.71 Release, 1.72 Release Feb 19, 2025
@purnesh42H
Copy link
Contributor Author

@easwars I have made the changes to match the final a88 watcher interface ResourceChanged, ResourceError and AmbientError including the conditions to call ResourceError or AmbientError based on cache.

Everything else is working fine but looks like this TestServeAndCloseDoNotRace is causing race https://github.com/grpc/grpc-go/actions/runs/14091957805/job/39470559995?pr=7977. Server is stopping in the middle of the test. I can't seem to figure how the watcher change is affecting this. Could you help take a look? Thanks.

@purnesh42H
Copy link
Contributor Author

purnesh42H commented Mar 26, 2025

@easwars I have made the changes to match the final a88 watcher interface ResourceChanged, ResourceError and AmbientError including the conditions to call ResourceError or AmbientError based on cache.

Everything else is working fine but looks like this TestServeAndCloseDoNotRace is causing race https://github.com/grpc/grpc-go/actions/runs/14091957805/job/39470559995?pr=7977. Server is stopping in the middle of the test. I can't seem to figure how the watcher change is affecting this. Could you help take a look? Thanks.

Ah, look like it was failing because of calling ResourceError when channel not found in authority.go which was calling ListenerWrapper onLDSResourceError which was switching serving mode. Seems like that's unexpected behavior? Calling AmbientError fixed it. Is it correct to consider channel not found to be AmbientError while registering the watcher?

@purnesh42H purnesh42H requested a review from easwars March 26, 2025 20:41
@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H Mar 26, 2025
@purnesh42H
Copy link
Contributor Author

@easwars I have made the changes to match the final a88 watcher interface ResourceChanged, ResourceError and AmbientError including the conditions to call ResourceError or AmbientError based on cache.
Everything else is working fine but looks like this TestServeAndCloseDoNotRace is causing race https://github.com/grpc/grpc-go/actions/runs/14091957805/job/39470559995?pr=7977. Server is stopping in the middle of the test. I can't seem to figure how the watcher change is affecting this. Could you help take a look? Thanks.

Ah, look like it was failing because of calling ResourceError when channel not found in authority.go which was calling ListenerWrapper onLDSResourceError which was switching serving mode. Seems like that's unexpected behavior? Calling AmbientError fixed it. Is it correct to consider channel not found to be AmbientError while registering the watcher?

Actually, I thought over it and right thing to do here is handle this case in listener wrapper. Like previously done, we handle lds error only in case of resource-not-found. Let me know if that's okay.

@markdroth
Copy link
Member

Ah, look like it was failing because of calling ResourceError when channel not found in authority.go which was calling ListenerWrapper onLDSResourceError which was switching serving mode. Seems like that's unexpected behavior? Calling AmbientError fixed it. Is it correct to consider channel not found to be AmbientError while registering the watcher?

If we start a watch for a resource and the authority of that resource is not found in the bootstrap config, then the right behavior is to report a non-ambient error to the watcher.

In general, we should never call OnAmbientError() for a watcher before we call OnResourceChanged(). If we get a transient error before we've seen a valid resource, then the transient error should be reported to OnResourceChanged() instead of OnAmbientError().

@purnesh42H
Copy link
Contributor Author

Ah, look like it was failing because of calling ResourceError when channel not found in authority.go which was calling ListenerWrapper onLDSResourceError which was switching serving mode. Seems like that's unexpected behavior? Calling AmbientError fixed it. Is it correct to consider channel not found to be AmbientError while registering the watcher?

If we start a watch for a resource and the authority of that resource is not found in the bootstrap config, then the right behavior is to report a non-ambient error to the watcher.

In general, we should never call OnAmbientError() for a watcher before we call OnResourceChanged(). If we get a transient error before we've seen a valid resource, then the transient error should be reported to OnResourceChanged() instead of OnAmbientError().

yeah handled it in ListenerWrapper. Still calling ResourceError for channel not found

@markdroth
Copy link
Member

yeah handled it in ListenerWrapper

I haven't looked at the code, but that seems wrong to me. There should not need to be any special handling for individual resource types -- the logic for this in XdsClient should be completely resource-type-agnostic.

The only thing that should be handled in individual resource types are things that are different from other resource types, like determining what resource type to pass to the watcher.

@purnesh42H
Copy link
Contributor Author

yeah handled it in ListenerWrapper

I haven't looked at the code, but that seems wrong to me. There should not need to be any special handling for individual resource types -- the logic for this in XdsClient should be completely resource-type-agnostic.

The only thing that should be handled in individual resource types are things that are different from other resource types, like determining what resource type to pass to the watcher.

This error for channel not found while registering watch https://github.com/grpc/grpc-go/pull/7977/files#diff-b4a775482793516a8ed23bd6a58a805d316f5abb6468580691c2b0556e22606aR619 was recently added. Listener wrapper for grpc xds server had a logic to stop serving only if error is of type ResourceNotFound so i just kept that instead of stopping always https://github.com/grpc/grpc-go/pull/7977/files#diff-e4706c72ae912399b7f8ee6f04cec2374ef7a7679b12358f201ddb0b45e34146R453-R457. If we don't do that TestServeAndCloseDoNotRace is causing the race because we are creating and stopping grpc xds servers in a loop . @easwars do you think we should rewrite this test?

@markdroth
Copy link
Member

Listener wrapper for grpc xds server had a logic to stop serving only if error is of type ResourceNotFound so i just kept that instead of stopping always https://github.com/grpc/grpc-go/pull/7977/files#diff-e4706c72ae912399b7f8ee6f04cec2374ef7a7679b12358f201ddb0b45e34146R453-R457.

I don't think that check is correct -- it should be removed. If the XdsClient delivers an error, it should always be passed through to the watcher.

That's an inherent part of the API change we're making in this PR. The errors that we were previously ignoring in OnError() are now being passed to OnAmbientError() instead. If you are seeing an error reported to OnError(), that means we actually want to drop the resource we saw previously, if any.

@easwars
Copy link
Contributor

easwars commented Mar 27, 2025

Yes, I agree that we should not have checks for "resource-not-found" errors in the individual watchers. In fact, I'm thinking we might not even need those error types like xdsresource.ErrorTypeResourceNotFound anymore. If ResourceError is invoked, then the watcher must stop using the previous configuration and if AmbientError is invoked, it should continue using the previous configuration and the xDS client should guarantee that AmbientError is only invoked when a previous copy of the resource was delivered via ResourceChanged.

@@ -813,7 +813,7 @@ func (s) TestResolverError(t *testing.T) {

// Grab the wrapped connection from the listener wrapper. This will be used
// to verify the connection is closed.
ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout)
ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout*100000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change.

Just FYI: In a large PR, things like this can be easily missed during the review process. Please be a little more mindful of such changes going forward.

Comment on lines +79 to +81
"server_uri": %q,
"channel_creds": [{"type": "insecure"}]
}]`, mgmtServer.Address)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it was indented properly earlier and this change is messing up the indenting.

// When used with a go-control-plane management server that continuously
// resends resources which are NACKed by the xDS client, using a `Replace()`
// here and in OnResourceDoesNotExist() simplifies tests which will have
// here and in ResourceError() simplifies tests which will have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: here and in AmbientError

// When used with a go-control-plane management server that continuously
// resends resources which are NACKed by the xDS client, using a `Replace()`
// here and in OnResourceDoesNotExist() simplifies tests which will have
// here and in ResourceError() simplifies tests which will have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

// When used with a go-control-plane management server that continuously
// resends resources which are NACKed by the xDS client, using a `Replace()`
// here and in OnResourceDoesNotExist() simplifies tests which will have
// here and in ResourceError() simplifies tests which will have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

// When used with a go-control-plane management server that continuously
// resends resources which are NACKed by the xDS client, using a `Replace()`
// here and in OnResourceDoesNotExist() simplifies tests which will have
// here and in ResourceError() simplifies tests which will have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

// When used with a go-control-plane management server that continuously
// resends resources which are NACKed by the xDS client, using a `Replace()`
// here and in OnResourceDoesNotExist() simplifies tests which will have
// here and in ResourceError() simplifies tests which will have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

@@ -161,7 +161,7 @@ func (s) TestHandleListenerResponseFromManagementServer(t *testing.T) {
Value: []byte{1, 2, 3, 4},
}},
},
wantErr: "Listener not found in received response",
wantErr: fmt.Sprintf("xds: resource \"%v\" of type \"ListenerResource\" does not exist", resourceName1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using %q instead of having to do \"%v\" make this easier to read?

Same with \"ListenerResource\". You could do a %q in the format string and include "ListenerResource" as one of the arguments to fmt.Sprintf.

Comment on lines +59 to +61
// of the update has completed. For example, if processing a resource requires
// watching new resources, those watches should be completed before done is
// called, which can happen after the ResourceWatcher method has returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please simplify/clarify/rephrase this sentence:

// For example, if processing a resource requires
// watching new resources, those watches should be completed before done is
// called, which can happen after the ResourceWatcher method has returned.

Specifically, what do you mean by "those watches should be completed before done is called"? I believe, just registration of the watch, right? So, that needs to be clarified.

And I don't understand what "which can happen after the ResourceWatcher method has returned" this means.

@easwars
Copy link
Contributor

easwars commented Mar 27, 2025

Everything else is working fine but looks like this TestServeAndCloseDoNotRace is causing race grpc/grpc-go/actions/runs/14091957805/job/39470559995?pr=7977. Server is stopping in the middle of the test.

Sorry, I don't have the cycles to debug this failure. But if you can dig a little deeper and get more info on the sequence of events leading to the failure, and why exactly the flake is happening, I might be able to help better. Once we have that information, we will be able to decide if we need to rewrite the test or if there is something wrong we are doing in the xDS client. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants