Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E: slim the graceful restart test #210

Closed
wants to merge 1 commit into from

Conversation

karampok
Copy link
Contributor

@karampok karampok commented Oct 17, 2024

This e2e test flakes often, and especially when the github actions
are slow. Therefore we reduce the overall peering to make the test
require less resources. What this commit does:

  1. GR test will only use peering from single node to ebgp-single-hop
    (so we will run 4x less docker exec to grab the routes).
  2. GR test will restart single frr-k8s pod and will monitor that single
    route and wait only that pod to be healthy.
  3. Given we target the GR between specific host and specific peer, we
    know which logs to look in case of failure.

/kind flake

What this PR does / why we need it:

Special notes for your reviewer:

Release note:

NONE

@karampok karampok force-pushed the slim-gr-test branch 2 times, most recently from 1e6eb25 to 9c37383 Compare October 17, 2024 09:07
@karampok
Copy link
Contributor Author

As part of the testing I have run the test 10 times (https://github.com/metallb/frr-k8s/actions/runs/11381447010?pr=210) and observed not failure.

@karampok karampok marked this pull request as ready for review October 18, 2024 12:41
@karampok
Copy link
Contributor Author

@fedepaol the failing test seems not related. Ready for feedback, thanks

@fedepaol
Copy link
Member

Can you explain the rationale on why this improves the stability of the test in the commit message? Here we are declaring only what we are doing, not why

@@ -136,3 +136,14 @@ func ContainersForVRF(frrs []*frrcontainer.FRR, vrf string) []*frrcontainer.FRR
}
return res
}

func ContainersForName(frrs []*frrcontainer.FRR, name string) []*frrcontainer.FRR {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if here we are selecting one single contianer, this should be ContainerByName and return a single container (and error or panic if it doesn't find it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -42,13 +40,16 @@ func FRRK8sPods(cs clientset.Interface) ([]*corev1.Pod, error) {
return res, nil
}

func RestartFRRK8sPods(cs clientset.Interface) error {
func RestartFRRK8sPods(cs clientset.Interface, node string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rework this to be accept one pod and have a wrapper one that restarts all the pods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit more, I do not understand the suggestion, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore the wrapper part. I thought we were using this beyond the gr tests.

I'd like this to do less. So let's have one function for RestartFRRK8sPod(cs, pod) and a way to retrieve the pod running on a given node (if we don't have it already).

From outside it's going to be

pod := podForNode(node)
restartPod(pod)

which I think is more straightforward. Also, given that we are talking about one single pod here, the waiting logic below can simplified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the not need loop logic, nevertheless, everything is in the same function because I need the node name to get the pod(oldPod), then delete the pod, then need the node name to fetch the pod (newPod), compare old,new are diff and status.

})

ginkgo.Context("When restarting the frrk8s deamon pods", func() {

ginkgo.DescribeTable("external BGP peer maintains routes", func(ipFam ipfamily.Family, prefix string) {
frrs := config.ContainersForVRF(infra.FRRContainers, "")
frrs := config.ContainersForName(infra.FRRContainers, "ebgp-single-hop")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's a single container, no need for the for loop below

@karampok karampok force-pushed the slim-gr-test branch 2 times, most recently from 38f0f3c to 52ffda7 Compare October 21, 2024 08:19
func RestartFRRK8sPods(cs clientset.Interface) error {
pods, err := FRRK8sPods(cs)
func RestartFRRK8sPod(cs clientset.Interface, nodeName string) error {
podForNode := func() (*corev1.Pod, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have this defined outside (below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -42,38 +40,43 @@ func FRRK8sPods(cs clientset.Interface) ([]*corev1.Pod, error) {
return res, nil
}

func RestartFRRK8sPods(cs clientset.Interface) error {
pods, err := FRRK8sPods(cs)
func RestartFRRK8sPod(cs clientset.Interface, nodeName string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this "RestartFRRK8sPodForNode".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@karampok karampok force-pushed the slim-gr-test branch 2 times, most recently from 514841b to 73e5f1f Compare October 21, 2024 08:30
This e2e test flakes often, and especially when the github actions
are slow. Therefore we reduce the overall peering to make the test
require less resources. What this commit does:
1. GR test will only use peering from single node to ebgp-single-hop
(so we will run 4x less `docker exec` to grab the routes).
2. GR test will restart single frr-k8s pod and will monitor that single
route and wait only that pod to be healthy.
3. Given we target the GR between specific host and specific peer, we
know which logs to look in case of failure.

Signed-off-by: karampok <[email protected]>
@karampok
Copy link
Contributor Author

The metallb_e2e fails again same

FAILED] Timed out after 120.000s.
  Unexpected error:
      <*errors.errorString | 0xc00045d680>: 
      metric frrk8s_bgp_session_up not found
      {
          s: "metric frrk8s_bgp_session_up not found",
      }
  occurred

is it known issue?

@fedepaol
Copy link
Member

This e2e test flakes often, and especially when the github actions are slow. Therefore we reduce the overall peering to make the test require less resources. What this commit does:

1. GR test will only use peering from single node to ebgp-single-hop
   (so we will run 4x less `docker exec` to grab the routes).

Is this a real time consuming issue?

2. GR test will restart single frr-k8s pod and will monitor that single
   route and wait only that pod to be healthy.

3. Given we target the GR between specific host and specific peer, we
   know which logs to look in case of failure.

This shouldn't be a blocker, if we have the proper failure information.

All in all, I think we should cover all the 4 combinations (i/e bgp - single / multi hop). The easiest is to iterate over the containers and run this test (in this PR's version) only once. Optimizations are, trying to understand if that docker exec is really an offender and in case see if we can spin up go routines to make that faster.

If I had to bet, restarting a pod takes a lot of time but collecting routes should not. WDYT?

@fedepaol
Copy link
Member

The metallb_e2e fails again same

FAILED] Timed out after 120.000s.
  Unexpected error:
      <*errors.errorString | 0xc00045d680>: 
      metric frrk8s_bgp_session_up not found
      {
          s: "metric frrk8s_bgp_session_up not found",
      }
  occurred

is it known issue?

could be related to the change of the port in metrics #206
I don't remember if I overrode the test or if it passed in that pr. Will check, but it's probably independent from this. I'll file a dummy pr to ensure it

@karampok
Copy link
Contributor Author

karampok commented Oct 22, 2024

Is this a real time consuming issue?

Update: Correcting, is not time consuming, I was thinking for the cpu resources. Did I say it is time consuming?

If run the test one after another like: for ibgp-single peer, restart the pod, routes ok, pod ok, then start again for ebgp then the spike on the system resources will be lower (in a period of 2minutes, I do 120 times docker exec time number of container)

If I had to bet, restarting a pod takes a lot of time but collecting routes should not. WDYT?

I would say is 70% pod takes long time, 30% docker exec fails. The diff here the route check (docker exec return routes) must always succeed, single failure fails the test. No other test require that granularity (eventually it just retries and temp failures are hidden).

We do need the get routes happening as fast as possible and the restart pod in the background to observe route removal (bgp graceful-restart preserve-fw-state)

Also to be in safe side 2m or 1m is hardcoded, because I cannot stop the test earlier once the pod is ready.

One suggestion is that I only reduce the restart pod to one node in the PR, and we observe how much flaky the test is before we refactor.

@karampok
Copy link
Contributor Author

Another idea would be to change the checking logic and instead of checking routes exists that I will check show ip route which includes the age of the route and if age is renewed to fail.

frr-k8s-control-plane# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

K>* 0.0.0.0/0 [0/0] via 172.18.0.1, eth0, 16:23:46
C * 10.244.0.1/32 is directly connected, veth6f5e688f, 16:23:46

not sure if leverage this in another test

@karampok karampok closed this Oct 28, 2024
@karampok
Copy link
Contributor Author

#218 should improve the the test not to fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants