Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e:daemonset: pod phase and events #1289

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Tal-or
Copy link
Contributor

@Tal-or Tal-or commented Jan 22, 2025

In case the DaemonSet's pods are not running for some reason,
we want to log the pod's phase and event for
better observability.

This could help us with finding the root cause
for the reason the pod didn't start.

Signed-off-by: Talor Itzhak [email protected]

@openshift-ci openshift-ci bot requested review from rbaturov and yanirq January 22, 2025 13:50
testlog.Infof("daemonset %q %q desired %d scheduled %d ready %d", namespace, name, ds.Status.DesiredNumberScheduled, ds.Status.CurrentNumberScheduled, ds.Status.NumberReady)
return ds.Status.DesiredNumberScheduled > 0 && ds.Status.DesiredNumberScheduled == ds.Status.NumberReady, nil
isRunning := ds.Status.DesiredNumberScheduled > 0 && ds.Status.DesiredNumberScheduled == ds.Status.NumberReady
if !isRunning {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: everything else being equal, please align the happy path to the left: https://medium.com/@matryer/line-of-sight-in-code-186dd7cdea88

Copy link
Contributor Author

@Tal-or Tal-or Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The happy path is isRunning == true which is the opposite of !isRunning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the logic into a separate function to make it more readable. I hope it's ok now

@Tal-or Tal-or force-pushed the node_inspector_events branch from 3e0bea5 to 963cc89 Compare January 22, 2025 14:10
@ffromani
Copy link
Contributor

thanks, we can go a step further (ofc not blocking)

func IsRunning(ctx context.Context, cli client.Client, namespace, name string) (bool, error) {
	ds, err := GetByName(ctx, cli, namespace, name)
	if err != nil {
		if k8serrors.IsNotFound(err) {
			testlog.Warningf("daemonset %q %q not found - retrying", namespace, name)
			return false, nil
		}
		return false, err
	}
	if isRunning := ds.Status.DesiredNumberScheduled > 0 && ds.Status.DesiredNumberScheduled == ds.Status.NumberReady; !isRunning {
		return false, logEventsAndPhase(ctx, cli, ds)
	}
	return true, nil
}

func logEventsAndPhase(ctx context.Context, cli client.Client, ds *appsv1.DaemonSet) error {
	podList := &corev1.PodList{}
	err := cli.List(ctx, podList, &client.ListOptions{
		Namespace:     ds.Namespace,
		LabelSelector: labels.SelectorFromSet(ds.Spec.Selector.MatchLabels),
	})
	if err != nil {
		return fmt.Errorf("failed to list pods for DaemonSet %s; %w", client.ObjectKeyFromObject(ds).String(), err)
	}
	for _, pod := range podList.Items {
		if pod.Status.Phase != corev1.PodRunning {
			podKey := client.ObjectKeyFromObject(&pod).String()
			testlog.Warningf("daemonset %s pod %s is not running, expected status %s got %s", ds.Name, podKey, corev1.PodRunning, pod.Status.Phase)
			podEvents, err := events.GetEventsForObject(cli, pod.Namespace, pod.Name, string(pod.UID))
			if err != nil {
				return fmt.Errorf("failed to list events for Pod %s; %w", podKey, err)
			}
			for _, event := range podEvents.Items {
				testlog.Warningf("-> %s %s %s", event.Action, event.Reason, event.Message)
			}
		}
	}
	return nil
}

@Tal-or
Copy link
Contributor Author

Tal-or commented Jan 22, 2025

thanks, we can go a step further (ofc not blocking)

func IsRunning(ctx context.Context, cli client.Client, namespace, name string) (bool, error) {
	ds, err := GetByName(ctx, cli, namespace, name)
	if err != nil {
		if k8serrors.IsNotFound(err) {
			testlog.Warningf("daemonset %q %q not found - retrying", namespace, name)
			return false, nil
		}
		return false, err
	}
	if isRunning := ds.Status.DesiredNumberScheduled > 0 && ds.Status.DesiredNumberScheduled == ds.Status.NumberReady; !isRunning {
		return false, logEventsAndPhase(ctx, cli, ds)
	}
	return true, nil
}

func logEventsAndPhase(ctx context.Context, cli client.Client, ds *appsv1.DaemonSet) error {
	podList := &corev1.PodList{}
	err := cli.List(ctx, podList, &client.ListOptions{
		Namespace:     ds.Namespace,
		LabelSelector: labels.SelectorFromSet(ds.Spec.Selector.MatchLabels),
	})
	if err != nil {
		return fmt.Errorf("failed to list pods for DaemonSet %s; %w", client.ObjectKeyFromObject(ds).String(), err)
	}
	for _, pod := range podList.Items {
		if pod.Status.Phase != corev1.PodRunning {
			podKey := client.ObjectKeyFromObject(&pod).String()
			testlog.Warningf("daemonset %s pod %s is not running, expected status %s got %s", ds.Name, podKey, corev1.PodRunning, pod.Status.Phase)
			podEvents, err := events.GetEventsForObject(cli, pod.Namespace, pod.Name, string(pod.UID))
			if err != nil {
				return fmt.Errorf("failed to list events for Pod %s; %w", podKey, err)
			}
			for _, event := range podEvents.Items {
				testlog.Warningf("-> %s %s %s", event.Action, event.Reason, event.Message)
			}
		}
	}
	return nil
}

Could you please highlight the diff from the existing implementation?

Add context from the caller instead of local
`context.TODO()`.

Signed-off-by: Talor Itzhak <[email protected]>
less iterations means less spamming especially within the nodeinspector
which log some data for every iteration.

Signed-off-by: Talor Itzhak <[email protected]>
In case the DaemonSet's pods are not running for some reason,
we want to log the pod's phase and event for
better observability.

This could help us with finding the root cause
for the reason the pod didn't start.

Signed-off-by: Talor Itzhak <[email protected]>
@Tal-or Tal-or force-pushed the node_inspector_events branch from 963cc89 to 7de0434 Compare January 22, 2025 14:51
@ffromani
Copy link
Contributor

thanks, we can go a step further (ofc not blocking)

func IsRunning(ctx context.Context, cli client.Client, namespace, name string) (bool, error) {
	ds, err := GetByName(ctx, cli, namespace, name)
	if err != nil {
		if k8serrors.IsNotFound(err) {
			testlog.Warningf("daemonset %q %q not found - retrying", namespace, name)
			return false, nil
		}
		return false, err
	}
	if isRunning := ds.Status.DesiredNumberScheduled > 0 && ds.Status.DesiredNumberScheduled == ds.Status.NumberReady; !isRunning {
		return false, logEventsAndPhase(ctx, cli, ds)
	}
	return true, nil
}

func logEventsAndPhase(ctx context.Context, cli client.Client, ds *appsv1.DaemonSet) error {
	podList := &corev1.PodList{}
	err := cli.List(ctx, podList, &client.ListOptions{
		Namespace:     ds.Namespace,
		LabelSelector: labels.SelectorFromSet(ds.Spec.Selector.MatchLabels),
	})
	if err != nil {
		return fmt.Errorf("failed to list pods for DaemonSet %s; %w", client.ObjectKeyFromObject(ds).String(), err)
	}
	for _, pod := range podList.Items {
		if pod.Status.Phase != corev1.PodRunning {
			podKey := client.ObjectKeyFromObject(&pod).String()
			testlog.Warningf("daemonset %s pod %s is not running, expected status %s got %s", ds.Name, podKey, corev1.PodRunning, pod.Status.Phase)
			podEvents, err := events.GetEventsForObject(cli, pod.Namespace, pod.Name, string(pod.UID))
			if err != nil {
				return fmt.Errorf("failed to list events for Pod %s; %w", podKey, err)
			}
			for _, event := range podEvents.Items {
				testlog.Warningf("-> %s %s %s", event.Action, event.Reason, event.Message)
			}
		}
	}
	return nil
}

Could you please highlight the diff from the existing implementation?

basically same logic, but keep pushing the happy path leftwards

@Tal-or
Copy link
Contributor Author

Tal-or commented Jan 22, 2025

thanks, we can go a step further (ofc not blocking)

func IsRunning(ctx context.Context, cli client.Client, namespace, name string) (bool, error) {
	ds, err := GetByName(ctx, cli, namespace, name)
	if err != nil {
		if k8serrors.IsNotFound(err) {
			testlog.Warningf("daemonset %q %q not found - retrying", namespace, name)
			return false, nil
		}
		return false, err
	}
	if isRunning := ds.Status.DesiredNumberScheduled > 0 && ds.Status.DesiredNumberScheduled == ds.Status.NumberReady; !isRunning {
		return false, logEventsAndPhase(ctx, cli, ds)
	}
	return true, nil
}

func logEventsAndPhase(ctx context.Context, cli client.Client, ds *appsv1.DaemonSet) error {
	podList := &corev1.PodList{}
	err := cli.List(ctx, podList, &client.ListOptions{
		Namespace:     ds.Namespace,
		LabelSelector: labels.SelectorFromSet(ds.Spec.Selector.MatchLabels),
	})
	if err != nil {
		return fmt.Errorf("failed to list pods for DaemonSet %s; %w", client.ObjectKeyFromObject(ds).String(), err)
	}
	for _, pod := range podList.Items {
		if pod.Status.Phase != corev1.PodRunning {
			podKey := client.ObjectKeyFromObject(&pod).String()
			testlog.Warningf("daemonset %s pod %s is not running, expected status %s got %s", ds.Name, podKey, corev1.PodRunning, pod.Status.Phase)
			podEvents, err := events.GetEventsForObject(cli, pod.Namespace, pod.Name, string(pod.UID))
			if err != nil {
				return fmt.Errorf("failed to list events for Pod %s; %w", podKey, err)
			}
			for _, event := range podEvents.Items {
				testlog.Warningf("-> %s %s %s", event.Action, event.Reason, event.Message)
			}
		}
	}
	return nil
}

Could you please highlight the diff from the existing implementation?

basically same logic, but keep pushing the happy path leftwards

it does. unless I'm missing something

@Tal-or
Copy link
Contributor Author

Tal-or commented Jan 23, 2025

/retest
infra issues

Copy link
Contributor

openshift-ci bot commented Jan 23, 2025

@Tal-or: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ffromani
Copy link
Contributor

ffromani commented Feb 3, 2025

/approve

Copy link
Contributor

openshift-ci bot commented Feb 3, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, Tal-or

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2025
@ffromani
Copy link
Contributor

ffromani commented Feb 3, 2025

/label acknowledge-critical-fixes-only

test code can't break the product and actually helps improving the product quality

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants