E2E: slim the graceful restart test #210

karampok · 2024-10-17T08:31:10Z

This e2e test flakes often, and especially when the github actions
are slow. Therefore we reduce the overall peering to make the test
require less resources. What this commit does:

GR test will only use peering from single node to ebgp-single-hop
(so we will run 4x less docker exec to grab the routes).
GR test will restart single frr-k8s pod and will monitor that single
route and wait only that pod to be healthy.
Given we target the GR between specific host and specific peer, we
know which logs to look in case of failure.

/kind flake

What this PR does / why we need it:

Special notes for your reviewer:

Release note:

NONE

karampok · 2024-10-17T09:08:08Z

As part of the testing I have run the test 10 times (https://github.com/metallb/frr-k8s/actions/runs/11381447010?pr=210) and observed not failure.

karampok · 2024-10-18T14:48:56Z

@fedepaol the failing test seems not related. Ready for feedback, thanks

fedepaol · 2024-10-18T15:39:12Z

Can you explain the rationale on why this improves the stability of the test in the commit message? Here we are declaring only what we are doing, not why

fedepaol · 2024-10-18T14:50:56Z

e2etests/pkg/config/from_containers.go

@@ -136,3 +136,14 @@ func ContainersForVRF(frrs []*frrcontainer.FRR, vrf string) []*frrcontainer.FRR
 	}
 	return res
 }
+
+func ContainersForName(frrs []*frrcontainer.FRR, name string) []*frrcontainer.FRR {


if here we are selecting one single contianer, this should be ContainerByName and return a single container (and error or panic if it doesn't find it)

fedepaol · 2024-10-18T14:53:28Z

e2etests/pkg/k8s/pods.go

@@ -42,13 +40,16 @@ func FRRK8sPods(cs clientset.Interface) ([]*corev1.Pod, error) {
 	return res, nil
 }

-func RestartFRRK8sPods(cs clientset.Interface) error {
+func RestartFRRK8sPods(cs clientset.Interface, node string) error {


I'd rework this to be accept one pod and have a wrapper one that restarts all the pods

Can you elaborate a bit more, I do not understand the suggestion, thanks!

Ignore the wrapper part. I thought we were using this beyond the gr tests.

I'd like this to do less. So let's have one function for RestartFRRK8sPod(cs, pod) and a way to retrieve the pod running on a given node (if we don't have it already).

From outside it's going to be

pod := podForNode(node) restartPod(pod)

which I think is more straightforward. Also, given that we are talking about one single pod here, the waiting logic below can simplified.

I have removed the not need loop logic, nevertheless, everything is in the same function because I need the node name to get the pod(oldPod), then delete the pod, then need the node name to fetch the pod (newPod), compare old,new are diff and status.

fedepaol · 2024-10-18T15:38:01Z

e2etests/tests/graceful_restart.go

 	})

 	ginkgo.Context("When restarting the frrk8s deamon pods", func() {

 		ginkgo.DescribeTable("external BGP peer maintains routes", func(ipFam ipfamily.Family, prefix string) {
-			frrs := config.ContainersForVRF(infra.FRRContainers, "")
+			frrs := config.ContainersForName(infra.FRRContainers, "ebgp-single-hop")


if it's a single container, no need for the for loop below

fedepaol · 2024-10-21T08:25:47Z

e2etests/pkg/k8s/pods.go

-func RestartFRRK8sPods(cs clientset.Interface) error {
-	pods, err := FRRK8sPods(cs)
+func RestartFRRK8sPod(cs clientset.Interface, nodeName string) error {
+	podForNode := func() (*corev1.Pod, error) {


let's have this defined outside (below)

fedepaol · 2024-10-21T08:26:20Z

e2etests/pkg/k8s/pods.go

@@ -42,38 +40,43 @@ func FRRK8sPods(cs clientset.Interface) ([]*corev1.Pod, error) {
 	return res, nil
 }

-func RestartFRRK8sPods(cs clientset.Interface) error {
-	pods, err := FRRK8sPods(cs)
+func RestartFRRK8sPod(cs clientset.Interface, nodeName string) error {


Let's rename this "RestartFRRK8sPodForNode".

This e2e test flakes often, and especially when the github actions are slow. Therefore we reduce the overall peering to make the test require less resources. What this commit does: 1. GR test will only use peering from single node to ebgp-single-hop (so we will run 4x less `docker exec` to grab the routes). 2. GR test will restart single frr-k8s pod and will monitor that single route and wait only that pod to be healthy. 3. Given we target the GR between specific host and specific peer, we know which logs to look in case of failure. Signed-off-by: karampok <[email protected]>

karampok · 2024-10-21T11:06:21Z

The metallb_e2e fails again same

FAILED] Timed out after 120.000s.
  Unexpected error:
      <*errors.errorString | 0xc00045d680>: 
      metric frrk8s_bgp_session_up not found
      {
          s: "metric frrk8s_bgp_session_up not found",
      }
  occurred

is it known issue?

fedepaol · 2024-10-22T08:37:16Z

This e2e test flakes often, and especially when the github actions are slow. Therefore we reduce the overall peering to make the test require less resources. What this commit does:
1. GR test will only use peering from single node to ebgp-single-hop
   (so we will run 4x less `docker exec` to grab the routes).

Is this a real time consuming issue?

2. GR test will restart single frr-k8s pod and will monitor that single
   route and wait only that pod to be healthy.

3. Given we target the GR between specific host and specific peer, we
   know which logs to look in case of failure.

This shouldn't be a blocker, if we have the proper failure information.

All in all, I think we should cover all the 4 combinations (i/e bgp - single / multi hop). The easiest is to iterate over the containers and run this test (in this PR's version) only once. Optimizations are, trying to understand if that docker exec is really an offender and in case see if we can spin up go routines to make that faster.

If I had to bet, restarting a pod takes a lot of time but collecting routes should not. WDYT?

fedepaol · 2024-10-22T08:53:43Z

The metallb_e2e fails again same

FAILED] Timed out after 120.000s.
  Unexpected error:
      <*errors.errorString | 0xc00045d680>: 
      metric frrk8s_bgp_session_up not found
      {
          s: "metric frrk8s_bgp_session_up not found",
      }
  occurred

is it known issue?

could be related to the change of the port in metrics #206
I don't remember if I overrode the test or if it passed in that pr. Will check, but it's probably independent from this. I'll file a dummy pr to ensure it

karampok · 2024-10-22T09:03:48Z

Is this a real time consuming issue?

Update: Correcting, is not time consuming, I was thinking for the cpu resources. Did I say it is time consuming?

If run the test one after another like: for ibgp-single peer, restart the pod, routes ok, pod ok, then start again for ebgp then the spike on the system resources will be lower (in a period of 2minutes, I do 120 times docker exec time number of container)

If I had to bet, restarting a pod takes a lot of time but collecting routes should not. WDYT?

I would say is 70% pod takes long time, 30% docker exec fails. The diff here the route check (docker exec return routes) must always succeed, single failure fails the test. No other test require that granularity (eventually it just retries and temp failures are hidden).

We do need the get routes happening as fast as possible and the restart pod in the background to observe route removal (bgp graceful-restart preserve-fw-state)

Also to be in safe side 2m or 1m is hardcoded, because I cannot stop the test earlier once the pod is ready.

One suggestion is that I only reduce the restart pod to one node in the PR, and we observe how much flaky the test is before we refactor.

karampok · 2024-10-22T09:19:59Z

Another idea would be to change the checking logic and instead of checking routes exists that I will check show ip route which includes the age of the route and if age is renewed to fail.

frr-k8s-control-plane# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

K>* 0.0.0.0/0 [0/0] via 172.18.0.1, eth0, 16:23:46
C * 10.244.0.1/32 is directly connected, veth6f5e688f, 16:23:46

not sure if leverage this in another test

karampok · 2024-10-28T12:00:44Z

#218 should improve the the test not to fail.

github-actions bot added the kind/flake label Oct 17, 2024

karampok force-pushed the slim-gr-test branch 2 times, most recently from 1e6eb25 to 9c37383 Compare October 17, 2024 09:07

karampok force-pushed the slim-gr-test branch from 9c37383 to 713781a Compare October 17, 2024 09:31

github-actions bot added kind/flake and removed kind/flake labels Oct 17, 2024

karampok marked this pull request as ready for review October 18, 2024 12:41

fedepaol reviewed Oct 18, 2024

View reviewed changes

karampok force-pushed the slim-gr-test branch 2 times, most recently from 38f0f3c to 52ffda7 Compare October 21, 2024 08:19

github-actions bot added kind/flake and removed kind/flake labels Oct 21, 2024

fedepaol reviewed Oct 21, 2024

View reviewed changes

karampok force-pushed the slim-gr-test branch 2 times, most recently from 514841b to 73e5f1f Compare October 21, 2024 08:30

karampok force-pushed the slim-gr-test branch from 73e5f1f to 38a00b7 Compare October 21, 2024 08:59

karampok closed this Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E: slim the graceful restart test #210

E2E: slim the graceful restart test #210

karampok commented Oct 17, 2024 •

edited

Loading

karampok commented Oct 17, 2024

karampok commented Oct 18, 2024

fedepaol commented Oct 18, 2024

fedepaol Oct 18, 2024

karampok Oct 21, 2024

fedepaol Oct 18, 2024

karampok Oct 21, 2024

fedepaol Oct 21, 2024

karampok Oct 21, 2024

fedepaol Oct 18, 2024

fedepaol Oct 21, 2024

karampok Oct 21, 2024

fedepaol Oct 21, 2024

karampok Oct 21, 2024

karampok commented Oct 21, 2024

fedepaol commented Oct 22, 2024

fedepaol commented Oct 22, 2024

karampok commented Oct 22, 2024 •

edited

Loading

karampok commented Oct 22, 2024

karampok commented Oct 28, 2024

E2E: slim the graceful restart test #210

E2E: slim the graceful restart test #210

Conversation

karampok commented Oct 17, 2024 • edited Loading

karampok commented Oct 17, 2024

karampok commented Oct 18, 2024

fedepaol commented Oct 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karampok commented Oct 21, 2024

fedepaol commented Oct 22, 2024

fedepaol commented Oct 22, 2024

karampok commented Oct 22, 2024 • edited Loading

karampok commented Oct 22, 2024

karampok commented Oct 28, 2024

karampok commented Oct 17, 2024 •

edited

Loading

karampok commented Oct 22, 2024 •

edited

Loading