Fix race on watcher update check #260

azdagron · 2024-01-17T19:09:20Z

When watchers are initialized, they wait until the Workload API has streamed back the initial response before returning from NewXXXWatcher. The semantics are intended such that a call to WaitForUpdate afterwards will only complete when the next response has arrived.

When an update is received, the flow is to:

record the update
close the "got first response" channel (only happens once)
send on the "updated" channel (to signal callers of WaitForUpdate)

However, this sequence is racy, since closing the "got first response" channel first unblocks the newWatcher call then drains the "updated" channel. If the drain happens after step (3) then everything is ok, but if it happens before, then step (3) will send on the channel, which is buffered. This causes the call to WaitForUpdate to unblock even though no update was received.

This change fixes the race by swapping steps (2) and (3). The "updated" channel is sent on and THEN the "got first response channel" is closed so that the drain can take place afterwards.

When watchers are initialized, they wait until the Workload API has streamed back the initial response before returning from NewXXXWatcher. The semantics are intended such that a call to WaitForUpdate afterwards will only complete when the next response has arrived. When an update is received, the flow is to: 1. record the update 2. close the "got first response" channel (only happens once) 3. send on the "updated" channel (to signal callers of WaitForUpdate) However, this sequence is racy, since closing the "got first response" channel first unblocks the newWatcher call then drains the "updated" channel. If the drain happens after step (3) then everything is ok, but if it happens before, then step (3) will send on the channel, which is buffered. This causes the call to WaitForUpdate to unblock even though no update was received. This change fixes the race by swapping steps (2) and (3). The "updated" channel is sent on and THEN the "got first response channel" is closed so that the drain can take place afterwards. Signed-off-by: Andrew Harding <[email protected]>

amartinezfayo

Thank you @azdagron, LGTM!

azdagron requested review from amartinezfayo and evan2645 as code owners January 17, 2024 19:09

azdagron force-pushed the azdagron/fix-wait-race branch from c5d02a2 to 61ead82 Compare January 17, 2024 19:11

amartinezfayo approved these changes Jan 17, 2024

View reviewed changes

azdagron merged commit 35e62f6 into main Jan 17, 2024
6 checks passed

azdagron deleted the azdagron/fix-wait-race branch January 17, 2024 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race on watcher update check #260

Fix race on watcher update check #260

azdagron commented Jan 17, 2024

amartinezfayo left a comment

Fix race on watcher update check #260

Fix race on watcher update check #260

Conversation

azdagron commented Jan 17, 2024

amartinezfayo left a comment

Choose a reason for hiding this comment