-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: detect and report cluster connection errors #559
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #559 +/- ##
==========================================
- Coverage 54.71% 53.95% -0.77%
==========================================
Files 41 41
Lines 4834 4934 +100
==========================================
+ Hits 2645 2662 +17
- Misses 1977 2058 +81
- Partials 212 214 +2 ☔ View full report in Codecov by Sentry. |
6dff0cb
to
223a309
Compare
pkg/cache/cluster.go
Outdated
@@ -615,6 +630,29 @@ func (c *clusterCache) watchEvents(ctx context.Context, api kube.APIResourceInfo | |||
if errors.IsNotFound(err) { | |||
c.stopWatching(api.GroupKind, ns) | |||
} | |||
var connectionUpdated bool | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Chetan, me again, a few more thoughts while thinking about the ramifications of the cluster invalidation c.Invalidate()
and c.EnsureSynced()
calls. (Feel free to poke holes in my logic!)
-
The watch function (defined here) is called for each watched resource, so if there are 100 resources, it will be called once for each, to establish the watch.
-
What about the scenario where only 1 watch (or a small number are failing), for example because the ServiceAccount does not have the correct permissions to watch the resource?
- If one watch fails 1 out of 100, will the cluster oscillate between Successful and Failed?
- It might look like this:
- Watch 1: succeeds, set cluster status to Successful
- Watch 2: succeeds, set cluster status to Successful
- Watch 3: succeeds, set cluster status to Successful
- (...)
- Watch 80: succeeds, set cluster status to Failed
- Perhaps the ideal behaviour would be to report success/failure only after all were established. (Easier said than done, I'm sure)
-
Another question: since it appears we are invalidating the cluster cache on any failure:
- Will this cause cache invalidation and re-sync to occur much more often, if the network connection between Argo CD and the target (watched) cluster happens to be unstable? (for example, imagine a case where 1% of connections fail. Would this magnify to invalidating and forcing a resync of all the resources, if only 1% of connections were failing?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Jonathan, your concerns are valid. Since watches could fail for any reason it is difficult to determine the cluster status based on the errors returned during watch. These errors could be transient making it complicated to keep track of their states. Also, the error message could be generic to determine why exactly the watch failed.
To fix this issue, I have included a goroutine that periodically checks for watch errors, pings the remote cluster(to get the version), and updates the status. By following this approach we don't rely completely on the watches and prevent the status from oscillating in the case of transient watch failures. We use the watch status to prevent the number of pings to the remote clusters by pinging the cluster only when the watches fail. I've also added checks to ensure we are invalidating the cluster cache only when the status changes which shouldn't happen frequently.
Let me know what you think. The periodic goroutine approach does handle some of the edge cases that may arise due to the dynamic nature of the watches. But also I'm open to hearing other approaches to solve this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to rewrite the cluster cache logic entirely, and replace the locks with go channels (which are much easier to reason against, when done right!), but that's probably not happening any time soon. 😄
pkg/cache/cluster.go
Outdated
for { | ||
select { | ||
case <-ticker.C: | ||
watchErrors := c.watchFails.len() | ||
// Ping the cluster for connection verification if there are watch failures or | ||
// if the cluster has recovered back from watch failures. | ||
if watchErrors > 0 || (watchErrors == 0 && c.connectionStatus == ConnectionStatusFailed) { | ||
c.log.V(1).Info("verifying cluster connection", "watches", watchErrors) | ||
|
||
_, err := c.kubectl.GetServerVersion(c.config) | ||
if err != nil { | ||
if c.connectionStatus != ConnectionStatusFailed { | ||
c.updateConnectionStatus(ConnectionStatusFailed) | ||
} | ||
} else if c.connectionStatus != ConnectionStatusSuccessful { | ||
c.updateConnectionStatus(ConnectionStatusSuccessful) | ||
} | ||
} | ||
case <-ctx.Done(): | ||
ticker.Stop() | ||
return | ||
} | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within this function:
- cluster cache lock should be owned before reading/writing from c.connectionStatus
- Now you might be tempted to wrap the entire
case
in a lock/unlock, BUT, we probably don't want to own the lock while callingGetServerVersion
(it's nearly always bad to block a lock on I/O), so that makes things a bit more complicated.
- Now you might be tempted to wrap the entire
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored the function with locks to read the c.connectionStatus. Managed to avoid locking around GetServerVersion
. Hopefully, this should fix the issue.
pkg/cache/cluster.go
Outdated
} | ||
|
||
func (c *clusterCache) clusterConnectionService(ctx context.Context) { | ||
clusterConnectionTimeout := 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if 10 seconds is maybe too fast. My main concern is that updateConnectionStatus
is a very heavy call, because (among other things) it will invalidate the cache and cancel all the watches.
Perhaps a minute? Perhaps longer? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used 10 seconds to poll the status of watch failures. We might have to ping the remote cluster every 10s if the watches are failing. But the function updateConnectionStatus
will run only once when the status changes. For the subsequent polls, the condition checks ensure we don't call it often.
These are the conditions that we check before invalidating the cache:
1. Check if there are watch failures or if it has recovered back(no watch errors but the connection status is 'Failed')
2. If either of the above conditions are met we ping the remote cluster to get the version.
3. If there is no error, but the connection status is Failed
we invalidate the cache.
4. If there is an error, but the connection status is Successful
we invalidate the cache.
I'm open to updating to a longer duration. But the tradeoff is we can't recognize the failures early. This should be okay I suppose, since Argo CD will not be primarily used for cluster monitoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend making the interval configurable per cache, so it can be parametrized from a consumer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the interval configurable 👍
ed68b74
to
76535b7
Compare
pkg/cache/cluster.go
Outdated
if watchErrors > 0 || watchesRecovered { | ||
c.log.V(1).Info("verifying cluster connection", "watches", watchErrors) | ||
|
||
_, err := c.kubectl.GetServerVersion(c.config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel evaluation of this call should be done with a certain degree of tolerance for intermittent failures, e.g. a (configurable) retry on certain errors such as a flaky network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we call this using a ticker, it will be evaluated every X seconds (default 10). So, it will be called again after an interval in the case of intermittent failures. Do we still need to explicitly retry? In that case, can we reuse the existing listRetryFunc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends - will a failing version call lead to cluster cache invalidation, and then be re-built in the next interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a rare possibility of frequent invalidations if both the GetServerversion()
and the watches fail/recover between successive ticker intervals. I have updated the call within a retry that checks for transient network errors using isTransientNetworkErr(). It should retry in the case of flaky network errors without invalidating the cache.
Quality Gate failedFailed conditions See analysis details on SonarCloud Catch issues before they fail your Quality Gate with our IDE extension SonarLint |
8587b12
to
cea6dcc
Compare
Quality Gate failedFailed conditions See analysis details on SonarCloud Catch issues before they fail your Quality Gate with our IDE extension SonarLint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @chetan-rns !
Signed-off-by: Chetan Banavikalmutt <[email protected]>
Signed-off-by: Chetan Banavikalmutt <[email protected]>
Signed-off-by: Chetan Banavikalmutt <[email protected]>
Signed-off-by: Chetan Banavikalmutt <[email protected]>
- Refactor the the clusterConnectionService to use locks before reading the status - Rename the function that monitors the cluster connection status - Fix typo Signed-off-by: Chetan Banavikalmutt <[email protected]>
Signed-off-by: Chetan Banavikalmutt <[email protected]>
Signed-off-by: Chetan Banavikalmutt <[email protected]>
cea6dcc
to
b3e1c67
Compare
Quality Gate passedIssues Measures |
Currently, Argo CD doesn't report connection errors on the cluster UI page unless the cache is invalidated (manually/periodically every 24 hours) or there is a sync error. During auto-sync, this error is only updated if there is a new commit to sync. This PR adds a new field
ConnectionStatus
that gets updated whenever Argo CD fails to access the cluster. The ClusterInfoUpdater will periodically fetch this cluster info and update theappstate
cache.