-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2E: slim the graceful restart test #210
Conversation
1e6eb25
to
9c37383
Compare
As part of the testing I have run the test 10 times (https://github.com/metallb/frr-k8s/actions/runs/11381447010?pr=210) and observed not failure. |
9c37383
to
713781a
Compare
@fedepaol the failing test seems not related. Ready for feedback, thanks |
Can you explain the rationale on why this improves the stability of the test in the commit message? Here we are declaring only what we are doing, not why |
@@ -136,3 +136,14 @@ func ContainersForVRF(frrs []*frrcontainer.FRR, vrf string) []*frrcontainer.FRR | |||
} | |||
return res | |||
} | |||
|
|||
func ContainersForName(frrs []*frrcontainer.FRR, name string) []*frrcontainer.FRR { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if here we are selecting one single contianer, this should be ContainerByName and return a single container (and error or panic if it doesn't find it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
e2etests/pkg/k8s/pods.go
Outdated
@@ -42,13 +40,16 @@ func FRRK8sPods(cs clientset.Interface) ([]*corev1.Pod, error) { | |||
return res, nil | |||
} | |||
|
|||
func RestartFRRK8sPods(cs clientset.Interface) error { | |||
func RestartFRRK8sPods(cs clientset.Interface, node string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rework this to be accept one pod and have a wrapper one that restarts all the pods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate a bit more, I do not understand the suggestion, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore the wrapper part. I thought we were using this beyond the gr tests.
I'd like this to do less. So let's have one function for RestartFRRK8sPod(cs, pod)
and a way to retrieve the pod running on a given node (if we don't have it already).
From outside it's going to be
pod := podForNode(node)
restartPod(pod)
which I think is more straightforward. Also, given that we are talking about one single pod here, the waiting logic below can simplified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the not need loop logic, nevertheless, everything is in the same function because I need the node name to get the pod(oldPod), then delete the pod, then need the node name to fetch the pod (newPod), compare old,new are diff and status.
e2etests/tests/graceful_restart.go
Outdated
}) | ||
|
||
ginkgo.Context("When restarting the frrk8s deamon pods", func() { | ||
|
||
ginkgo.DescribeTable("external BGP peer maintains routes", func(ipFam ipfamily.Family, prefix string) { | ||
frrs := config.ContainersForVRF(infra.FRRContainers, "") | ||
frrs := config.ContainersForName(infra.FRRContainers, "ebgp-single-hop") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's a single container, no need for the for loop below
38f0f3c
to
52ffda7
Compare
e2etests/pkg/k8s/pods.go
Outdated
func RestartFRRK8sPods(cs clientset.Interface) error { | ||
pods, err := FRRK8sPods(cs) | ||
func RestartFRRK8sPod(cs clientset.Interface, nodeName string) error { | ||
podForNode := func() (*corev1.Pod, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's have this defined outside (below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
e2etests/pkg/k8s/pods.go
Outdated
@@ -42,38 +40,43 @@ func FRRK8sPods(cs clientset.Interface) ([]*corev1.Pod, error) { | |||
return res, nil | |||
} | |||
|
|||
func RestartFRRK8sPods(cs clientset.Interface) error { | |||
pods, err := FRRK8sPods(cs) | |||
func RestartFRRK8sPod(cs clientset.Interface, nodeName string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this "RestartFRRK8sPodForNode".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
514841b
to
73e5f1f
Compare
This e2e test flakes often, and especially when the github actions are slow. Therefore we reduce the overall peering to make the test require less resources. What this commit does: 1. GR test will only use peering from single node to ebgp-single-hop (so we will run 4x less `docker exec` to grab the routes). 2. GR test will restart single frr-k8s pod and will monitor that single route and wait only that pod to be healthy. 3. Given we target the GR between specific host and specific peer, we know which logs to look in case of failure. Signed-off-by: karampok <[email protected]>
73e5f1f
to
38a00b7
Compare
The
is it known issue? |
Is this a real time consuming issue?
This shouldn't be a blocker, if we have the proper failure information. All in all, I think we should cover all the 4 combinations (i/e bgp - single / multi hop). The easiest is to iterate over the containers and run this test (in this PR's version) only once. Optimizations are, trying to understand if that docker exec is really an offender and in case see if we can spin up go routines to make that faster. If I had to bet, restarting a pod takes a lot of time but collecting routes should not. WDYT? |
could be related to the change of the port in metrics #206 |
Update: Correcting, is not time consuming, I was thinking for the cpu resources. Did I say it is time consuming? If run the test one after another like: for ibgp-single peer, restart the pod, routes ok, pod ok, then start again for ebgp then the spike on the system resources will be lower (in a period of 2minutes, I do 120 times docker exec time number of container)
I would say is 70% pod takes long time, 30% docker exec fails. The diff here the route check (docker exec return routes) must always succeed, single failure fails the test. No other test require that granularity (eventually it just retries and temp failures are hidden). We do need the get routes happening as fast as possible and the restart pod in the background to observe route removal (bgp graceful-restart preserve-fw-state) Also to be in safe side 2m or 1m is hardcoded, because I cannot stop the test earlier once the pod is ready. One suggestion is that I only reduce the restart pod to one node in the PR, and we observe how much flaky the test is before we refactor. |
Another idea would be to change the checking logic and instead of checking routes exists that I will check
not sure if leverage this in another test |
#218 should improve the the test not to fail. |
This e2e test flakes often, and especially when the github actions
are slow. Therefore we reduce the overall peering to make the test
require less resources. What this commit does:
(so we will run 4x less
docker exec
to grab the routes).route and wait only that pod to be healthy.
know which logs to look in case of failure.
/kind flake
What this PR does / why we need it:
Special notes for your reviewer:
Release note: