Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: detect and report cluster connection errors #559

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

chetan-rns
Copy link
Member

Currently, Argo CD doesn't report connection errors on the cluster UI page unless the cache is invalidated (manually/periodically every 24 hours) or there is a sync error. During auto-sync, this error is only updated if there is a new commit to sync. This PR adds a new field ConnectionStatus that gets updated whenever Argo CD fails to access the cluster. The ClusterInfoUpdater will periodically fetch this cluster info and update the appstate cache.

Copy link

codecov bot commented Dec 5, 2023

Codecov Report

Attention: Patch coverage is 22.32143% with 87 lines in your changes are missing coverage. Please review.

Project coverage is 53.95%. Comparing base (5fd9f44) to head (cea6dcc).
Report is 4 commits behind head on master.

❗ Current head cea6dcc differs from pull request most recent head b3e1c67. Consider uploading reports for the commit b3e1c67 to get more accurate results

Files Patch % Lines
pkg/cache/cluster.go 24.27% 76 Missing and 2 partials ⚠️
pkg/cache/settings.go 0.00% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #559      +/-   ##
==========================================
- Coverage   54.71%   53.95%   -0.77%     
==========================================
  Files          41       41              
  Lines        4834     4934     +100     
==========================================
+ Hits         2645     2662      +17     
- Misses       1977     2058      +81     
- Partials      212      214       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@chetan-rns chetan-rns force-pushed the update-cluster-status branch 2 times, most recently from 6dff0cb to 223a309 Compare December 5, 2023 13:01
@chetan-rns chetan-rns changed the title bug: detect and report cluster connection errors fix: detect and report cluster connection errors Dec 5, 2023
pkg/cache/cluster.go Outdated Show resolved Hide resolved
@chetan-rns chetan-rns requested a review from jgwest January 3, 2024 13:29
@@ -615,6 +630,29 @@ func (c *clusterCache) watchEvents(ctx context.Context, api kube.APIResourceInfo
if errors.IsNotFound(err) {
c.stopWatching(api.GroupKind, ns)
}
var connectionUpdated bool
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Chetan, me again, a few more thoughts while thinking about the ramifications of the cluster invalidation c.Invalidate() and c.EnsureSynced() calls. (Feel free to poke holes in my logic!)

  • The watch function (defined here) is called for each watched resource, so if there are 100 resources, it will be called once for each, to establish the watch.

  • What about the scenario where only 1 watch (or a small number are failing), for example because the ServiceAccount does not have the correct permissions to watch the resource?

    • If one watch fails 1 out of 100, will the cluster oscillate between Successful and Failed?
    • It might look like this:
    • Watch 1: succeeds, set cluster status to Successful
    • Watch 2: succeeds, set cluster status to Successful
    • Watch 3: succeeds, set cluster status to Successful
    • (...)
    • Watch 80: succeeds, set cluster status to Failed
    • Perhaps the ideal behaviour would be to report success/failure only after all were established. (Easier said than done, I'm sure)
  • Another question: since it appears we are invalidating the cluster cache on any failure:

    • Will this cause cache invalidation and re-sync to occur much more often, if the network connection between Argo CD and the target (watched) cluster happens to be unstable? (for example, imagine a case where 1% of connections fail. Would this magnify to invalidating and forcing a resync of all the resources, if only 1% of connections were failing?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jonathan, your concerns are valid. Since watches could fail for any reason it is difficult to determine the cluster status based on the errors returned during watch. These errors could be transient making it complicated to keep track of their states. Also, the error message could be generic to determine why exactly the watch failed.

To fix this issue, I have included a goroutine that periodically checks for watch errors, pings the remote cluster(to get the version), and updates the status. By following this approach we don't rely completely on the watches and prevent the status from oscillating in the case of transient watch failures. We use the watch status to prevent the number of pings to the remote clusters by pinging the cluster only when the watches fail. I've also added checks to ensure we are invalidating the cluster cache only when the status changes which shouldn't happen frequently.

Let me know what you think. The periodic goroutine approach does handle some of the edge cases that may arise due to the dynamic nature of the watches. But also I'm open to hearing other approaches to solve this problem.

@chetan-rns chetan-rns requested a review from jgwest January 29, 2024 14:32
Copy link
Member

@jgwest jgwest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to rewrite the cluster cache logic entirely, and replace the locks with go channels (which are much easier to reason against, when done right!), but that's probably not happening any time soon. 😄

pkg/cache/cluster.go Outdated Show resolved Hide resolved
Comment on lines 1246 to 1300
for {
select {
case <-ticker.C:
watchErrors := c.watchFails.len()
// Ping the cluster for connection verification if there are watch failures or
// if the cluster has recovered back from watch failures.
if watchErrors > 0 || (watchErrors == 0 && c.connectionStatus == ConnectionStatusFailed) {
c.log.V(1).Info("verifying cluster connection", "watches", watchErrors)

_, err := c.kubectl.GetServerVersion(c.config)
if err != nil {
if c.connectionStatus != ConnectionStatusFailed {
c.updateConnectionStatus(ConnectionStatusFailed)
}
} else if c.connectionStatus != ConnectionStatusSuccessful {
c.updateConnectionStatus(ConnectionStatusSuccessful)
}
}
case <-ctx.Done():
ticker.Stop()
return
}
}

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within this function:

  • cluster cache lock should be owned before reading/writing from c.connectionStatus
    • Now you might be tempted to wrap the entire case in a lock/unlock, BUT, we probably don't want to own the lock while calling GetServerVersion (it's nearly always bad to block a lock on I/O), so that makes things a bit more complicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the function with locks to read the c.connectionStatus. Managed to avoid locking around GetServerVersion. Hopefully, this should fix the issue.

pkg/cache/cluster.go Outdated Show resolved Hide resolved
pkg/cache/cluster.go Outdated Show resolved Hide resolved
}

func (c *clusterCache) clusterConnectionService(ctx context.Context) {
clusterConnectionTimeout := 10 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if 10 seconds is maybe too fast. My main concern is that updateConnectionStatus is a very heavy call, because (among other things) it will invalidate the cache and cancel all the watches.

Perhaps a minute? Perhaps longer? WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used 10 seconds to poll the status of watch failures. We might have to ping the remote cluster every 10s if the watches are failing. But the function updateConnectionStatus will run only once when the status changes. For the subsequent polls, the condition checks ensure we don't call it often.

These are the conditions that we check before invalidating the cache:
1. Check if there are watch failures or if it has recovered back(no watch errors but the connection status is 'Failed')
2. If either of the above conditions are met we ping the remote cluster to get the version.
3. If there is no error, but the connection status is Failed we invalidate the cache.
4. If there is an error, but the connection status is Successful we invalidate the cache.

I'm open to updating to a longer duration. But the tradeoff is we can't recognize the failures early. This should be okay I suppose, since Argo CD will not be primarily used for cluster monitoring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend making the interval configurable per cache, so it can be parametrized from a consumer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the interval configurable 👍

@chetan-rns chetan-rns force-pushed the update-cluster-status branch from ed68b74 to 76535b7 Compare February 9, 2024 05:12
if watchErrors > 0 || watchesRecovered {
c.log.V(1).Info("verifying cluster connection", "watches", watchErrors)

_, err := c.kubectl.GetServerVersion(c.config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel evaluation of this call should be done with a certain degree of tolerance for intermittent failures, e.g. a (configurable) retry on certain errors such as a flaky network.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we call this using a ticker, it will be evaluated every X seconds (default 10). So, it will be called again after an interval in the case of intermittent failures. Do we still need to explicitly retry? In that case, can we reuse the existing listRetryFunc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends - will a failing version call lead to cluster cache invalidation, and then be re-built in the next interval?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a rare possibility of frequent invalidations if both the GetServerversion() and the watches fail/recover between successive ticker intervals. I have updated the call within a retry that checks for transient network errors using isTransientNetworkErr(). It should retry in the case of flaky network errors without invalidating the cache.

Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
C Security Rating on New Code (required ≥ A)

See analysis details on SonarCloud

idea Catch issues before they fail your Quality Gate with our IDE extension SonarLint SonarLint

@chetan-rns chetan-rns requested a review from jgwest February 19, 2024 11:17
@chetan-rns chetan-rns force-pushed the update-cluster-status branch from 8587b12 to cea6dcc Compare April 4, 2024 10:34
Copy link

sonarqubecloud bot commented Apr 4, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
C Security Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

@chetan-rns chetan-rns requested a review from jannfis April 4, 2024 10:53
Copy link
Member

@jannfis jannfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @chetan-rns !

chetan-rns added 7 commits May 9, 2024 18:37
Signed-off-by: Chetan Banavikalmutt <[email protected]>
- Refactor the the clusterConnectionService to use locks before reading the status
- Rename the function that monitors the cluster connection status
- Fix typo

Signed-off-by: Chetan Banavikalmutt <[email protected]>
@chetan-rns chetan-rns force-pushed the update-cluster-status branch from cea6dcc to b3e1c67 Compare May 9, 2024 13:10
Copy link

sonarqubecloud bot commented May 9, 2024

Quality Gate Passed Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
1.7% Duplication on New Code

See analysis details on SonarCloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants