Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASMINST-6829 - CSM health checks should detect when Weave has a peer in "sleeve" mode #573

Merged
merged 2 commits into from
Apr 2, 2024

Conversation

spillerc-hpe
Copy link
Contributor

Summary and Scope

If Weave has peers running in "sleeve" mode instead of the fast dataplane (fastdp) mode then pod to pod communication errors can occur resulting in strange behaviour from services running in Kubernetes.

The upgrade procedure has a check for this due to an old issue FN #6636 where the MTU was set incorrectly however the standard CSM health validation tests do not check for this condition.

This PR modifies the goss-weave-health.yaml test to check for this condition.

Issues and Related PRs

List and characterize relationship to Jira/Github issues and other pull requests. Be sure to list dependencies.

Testing

Tested on:

  • surtur

Test description:

Verified test works as expected.

ncn-m001:/opt/cray/tests/install/ncn/tests # /opt/cray/tests/install/ncn/automated/ncn-k8s-combined-healthcheck
NCN and Kubernetes Checks
------------------------------

DEBUG: cmsdev_test_list='bos cfs conman ims tftp vcs'
Writing full output to /opt/cray/tests/install/logs/print_goss_json_results/20240328_102745.753883-324520-pzJjmDlk/out

Running tests

Checking test results
Only errors will be printed to the screen

Result: FAIL
Source: http://ncn-m001.hmn:9001/ncn-kubernetes-tests-master
Test Name: Weave Health
Description: Check that Weave is healthy. Run "weave --local status connections" on any unhealthy node and check the MTU.
Test Summary: Command: k8s_check_weave_status: stdout: patterns not found: [!/.*sleeve/]
Execution Time: 0.000115756 seconds
Node: ncn-m001

Risks and Mitigations

Pull Request Checklist

  • Version number(s) incremented, if applicable
  • Copyrights updated
  • License file intact
  • Target branch correct
  • CHANGELOG.md updated
  • Testing is appropriate and complete, if applicable
  • HPC Product Announcement prepared, if applicable

@spillerc-hpe spillerc-hpe requested a review from a team as a code owner March 28, 2024 11:51
@spillerc-hpe
Copy link
Contributor Author

/backport release/1.5

Copy link
Contributor

Backporting into branch release/1.5 was successful. New PR: #574

Copy link

@dborman-hpe dborman-hpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@spillerc-hpe spillerc-hpe merged commit 138752b into release/1.6 Apr 2, 2024
3 checks passed
@spillerc-hpe spillerc-hpe deleted the CASMINST-6829 branch April 2, 2024 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants