Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guest agent ping wait #63

Merged
merged 6 commits into from
Apr 4, 2024

Conversation

RamLavi
Copy link
Collaborator

@RamLavi RamLavi commented Apr 1, 2024

This PR introduces a new algorithm to make sure that the checkups start only after the VMI properly configured for realtime.
Instead of waiting for the VMI guest-agent condition to be true, introducing a guest-agent-ping probe that will poll for the marker file signaling that the VMI is ready.
The boot script is moved to cloud-init context for easier maintenance.

RamLavi added 2 commits March 28, 2024 15:25
Currently the VM guests are configured on a service added on the
customize-vm script created in the image building process.
There is no real need for the script to run on that context, and it is
easier to manage if it runs as part of the cloud-init context.
This commits:
- copies the script to a checkup function.
- adds the boot script to the vmi-under-test VM
- mounts the script with the correct permissions.

Signed-off-by: Ram Lavi <[email protected]>
Currently the VM guests are configured on a service added on the
customize-vm script created in the image building process.
There is no real need for the script to run on that context, and it is
easier to manage if it runs as part of the cloud-init context.
This commits:
- Runs the boot script on the cloud-init context.
- Removes it from the images customize-vm scripts.

Signed-off-by: Ram Lavi <[email protected]>
@RamLavi
Copy link
Collaborator Author

RamLavi commented Apr 1, 2024

passes CI on CNV 4.16 cluster:

make e2e-test
mkdir -p /home/ralavi/go/src/github.com/kiagnose/kubevirt-realtime-checkup/_go-cache
podman run --rm \
	-v /home/ralavi/go/src/github.com/kiagnose/kubevirt-realtime-checkup:/go/src/github.com/kiagnose/kubevirt-realtime-checkup:Z \
	-v /home/ralavi/go/src/github.com/kiagnose/kubevirt-realtime-checkup/_go-cache:/root/.cache/go-build:Z \
	-v /home/ralavi/.kube/sno03-cnvqe2-rdu2:/root/.kube:Z,ro \
	--workdir /go/src/github.com/kiagnose/kubevirt-realtime-checkup \
	-e KUBECONFIG=/root/.kube/kubeconfig \
	-e TEST_NAMESPACE=realtime-checkup-1 \
	-e TEST_CHECKUP_IMAGE=quay.io/ramlavi/kubevirt-realtime-checkup:devel \
	-e VM_UNDER_TEST_CONTAINER_DISK_IMAGE=quay.io/ramlavi/kubevirt-realtime-checkup-vm:latest \
	docker.io/library/golang:1.20.12 \
	go test -v ./tests/... -test.v -test.timeout=1h -ginkgo.v -ginkgo.timeout=1h 
=== RUN   TestTests
Running Suite: Tests Suite - /go/src/github.com/kiagnose/kubevirt-realtime-checkup/tests
========================================================================================
Random Seed: 1711973458

Will run 1 of 1 specs
------------------------------
[BeforeSuite] 
/go/src/github.com/kiagnose/kubevirt-realtime-checkup/tests/tests_suite_test.go:39
[BeforeSuite] PASSED [0.005 seconds]
------------------------------
Checkup execution should complete successfully
/go/src/github.com/kiagnose/kubevirt-realtime-checkup/tests/checkup_test.go:80
• [747.988 seconds]
------------------------------

Ran 1 of 1 Specs in 747.994 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestTests (748.00s)
PASS
ok  	github.com/kiagnose/kubevirt-realtime-checkup/tests	748.012s

logs:

2024/04/01 12:11:01 kubevirt-realtime-checkup starting...
2024/04/01 12:11:01 Using the following config:
2024/04/01 12:11:01 	"vmUnderTestTargetNodeName": ""
2024/04/01 12:11:01 	"vmUnderTestContainerDiskImage": "quay.io/ramlavi/kubevirt-realtime-checkup-vm:latest"
2024/04/01 12:11:01 	"oslatDuration": "10m0s"
2024/04/01 12:11:01 	"oslatLatencyThresholdMicroSeconds": "45µs"
2024/04/01 12:11:01 Creating ConfigMap "realtime-checkup-1/realtime-vm-config"...
W0401 12:11:02.034552       1 warnings.go:70] metadata.finalizers: "foregroundDeleteVirtualMachine": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers
2024/04/01 12:11:02 Waiting for VMI "realtime-checkup-1/realtime-vmi-under-test-cmv5k" to be ready...
2024/04/01 12:13:07 VMI "realtime-checkup-1/realtime-vmi-under-test-cmv5k" has successfully reached ready condition
2024/04/01 12:13:07 Login to VMI under test...
2024/04/01 12:13:12 VMI under test guest kernel Args: cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-348.rt7.130.el8.x86_64 root=UUID=a9332d7d-1762-41cd-a702-6b2cc556c248 ro console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=d4ac9572-e828-4c87-9b76-e59c9fa6e426 console=ttyS0,115200 skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 isolcpus=managed_irq,domain,2-3 intel_pstate=disable nosoftlockup nohz=on nohz_full=2-3 rcu_nocbs=2-3 irqaffinity=0,1
[root@realtime-vmi-under-test-cmv5k ~]# 
2024/04/01 12:13:12 Running Oslat test on VMI under test for 10m0s...
2024/04/01 12:23:14 Oslat test completed:
taskset -c 2-3 oslat --cpu-list 2-3 --rtprio 1 --duration 10m0s --workload memmove --workload-mem 4K 
oslat V 2.60
Total runtime: 		600 seconds
Thread priority: 	SCHED_FIFO:1
CPU list: 		2-3
CPU for main thread: 	0
Workload: 		memmove
Workload mem: 		4 (KiB)
Preheat cores: 		2

Pre-heat for 1 seconds...
Test starts...
Test completed.

        Core:	 2 3
Counter Freq:	 2096 2096 (MHz)
    001 (us):	 5095739427 5095738846
    002 (us):	 0 0
    003 (us):	 254 253
    004 (us):	 85 84
    005 (us):	 15 19
    006 (us):	 32 36
    007 (us):	 23 20
    008 (us):	 1 1
    009 (us):	 2 1
    010 (us):	 1 1
    011 (us):	 0 0
    012 (us):	 0 0
    013 (us):	 0 0
    014 (us):	 0 0
    015 (us):	 0 0
    016 (us):	 0 0
    017 (us):	 0 0
    018 (us):	 0 0
    019 (us):	 0 0
    020 (us):	 0 0
    021 (us):	 0 0
    022 (us):	 0 0
    023 (us):	 0 0
    024 (us):	 0 2
    025 (us):	 0 0
    026 (us):	 0 0
    027 (us):	 0 0
    028 (us):	 0 0
    029 (us):	 0 0
    030 (us):	 0 0
    031 (us):	 0 0
    032 (us):	 0 0 (including overflows)
     Minimum:	 0 0 (us)
     Average:	 1.000 1.000 (us)
     Maximum:	 9 23 (us)
     Max-Min:	 9 23 (us)
    Duration:	 599.733 599.733 (sec)

[root@realtime-vmi-under-test-cmv5k ~]# 
2024/04/01 12:23:14 Max Oslat Latency measured: 23µs
2024/04/01 12:23:14 Trying to delete VMI: "realtime-checkup-1/realtime-vmi-under-test-cmv5k"
2024/04/01 12:23:14 Deleting ConfigMap "realtime-checkup-1/realtime-vm-config"...
2024/04/01 12:23:14 Waiting for VMI "realtime-checkup-1/realtime-vmi-under-test-cmv5k" to be deleted...
2024/04/01 12:23:19 VMI "realtime-checkup-1/realtime-vmi-under-test-cmv5k" was deleted successfully

@RamLavi RamLavi marked this pull request as ready for review April 1, 2024 17:47
@RamLavi RamLavi changed the title [WIP] Add guest agent ping wait Add guest agent ping wait Apr 2, 2024
@RamLavi RamLavi requested a review from orelmisan April 2, 2024 09:07
Copy link
Member

@orelmisan orelmisan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good

@@ -288,6 +288,7 @@ func generateBootScript() string {
sb.WriteString(" tuned_conf=\"/etc/tuned/realtime-virtual-guest-variables.conf\"\n")
sb.WriteString(" echo \"isolated_cores=" + isolatedCores + "\" > \"$tuned_conf\"\n")
sb.WriteString(" echo \"isolate_managed_irq=Y\" >> \"$tuned_conf\"\n")
sb.WriteString(" systemctl restart tuned.service\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between the real-time and DPDK checkup with regard to this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

service agter setting the

Typo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between the real-time and DPDK checkup with regard to this change?

This change should probably be done on DPDK as well. We didn't see the problem there but it may be only due to timing consideration..

Typo?

Fixed typo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this command is async, we should wait for it to finish.

Comment on lines +269 to +272
FailureThreshold: 30,
InitialDelaySeconds: 90,
PeriodSeconds: 10,
TimeoutSeconds: 10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these values the same as in the DPDK checkup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep.
Saw no reason to change them

RamLavi added 4 commits April 4, 2024 11:42
sometimes tuned service is not up by the time tuned-adm is run. In order
to make sure it's running with the correct conf file, restarting the
service after setting the conf and before running tuned-adm

Signed-off-by: Ram Lavi <[email protected]>
This commit enables the guest-agent-exec option on the guest, in order
to use the probe polling option, that will be introduced in future
commits.

Signed-off-by: Ram Lavi <[email protected]>
When the boot script is done, after tuned-adm is set and reboot - the
BootScriptReadinessMarkerFileFullPath file marker is set.
In order to be polled [0] by guest-agent ping in later commits, adding
the proper selinux context to this file.

[0]
https://kubevirt.io/user-guide/virtual_machines/liveness_and_readiness_probes/#defining-guest-agent-ping-probes

Signed-off-by: Ram Lavi <[email protected]>
urrently the checkup setup only waits for the VMI to finish "booting",
i.e. the guest agent service to be ready.
However this is not enough in order to ensure that the VMI has been
configured, a procedure currently done on the cloud-init service.
When the configuration is complete, the script configuring the guest in
the cloud-init service adds a marker file.
Changing the wait condition to "ready" is stronger than waiting for
"guest-agent", since the guest-agent ping needs guest agent in order to
set the VMI to ready.

This commit:
- introduces a new waiting mechanism, using guest-agent-ping probe [0]
to poll-wait the guest until the file is present, and only then sets the
VMI ready condition to true.
- adds a wait for the VMI ready condition to be true.
- removed the now unnecessary wait for guest-agent-ready condition.

[0]
https://kubevirt.io/user-guide/virtual_machines/liveness_and_readiness_probes/#defining-guest-agent-ping-probes

Signed-off-by: Ram Lavi <[email protected]>
@RamLavi RamLavi force-pushed the add_guest_agent_ping_wait branch from 8bf7ae1 to e84137f Compare April 4, 2024 08:45
@RamLavi
Copy link
Collaborator Author

RamLavi commented Apr 4, 2024

Change: Typo in commit message

Copy link
Member

@orelmisan orelmisan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @RamLavi

@@ -288,6 +288,7 @@ func generateBootScript() string {
sb.WriteString(" tuned_conf=\"/etc/tuned/realtime-virtual-guest-variables.conf\"\n")
sb.WriteString(" echo \"isolated_cores=" + isolatedCores + "\" > \"$tuned_conf\"\n")
sb.WriteString(" echo \"isolate_managed_irq=Y\" >> \"$tuned_conf\"\n")
sb.WriteString(" systemctl restart tuned.service\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this command is async, we should wait for it to finish.

@RamLavi
Copy link
Collaborator Author

RamLavi commented Apr 4, 2024

@orelmisan I believe that systemctl is blocking, otherwise it wouldn't have had this option:

$ sudo systemctl --help | grep no-block
     --no-block          Do not wait until operation finished

@RamLavi RamLavi merged commit 8cc5be0 into kiagnose:main Apr 4, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants