[CI] Use Kubernetes GC to clean kubevirt VMs (packet-* jobs) #11530

VannTen · 2024-09-13T10:58:37Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
We regularly have CI flakes where the job failed to delete k8s namespace in the CI cluster.
It's not much, but it's a little hiccup in the PR process which I'd like to eliminate.

I'm not sure what the exact reason is, probably some race between the jobs and the time between fetching the list of namespace and the deletion.
Regardless, a simpler way to delete the VMs is to let them be dependants (in the kubernetes sense) of the job pod. This way, once the job pod is deleted, kubernetes garbage collection in the CI cluster will take care of removing the associated VMs

Special notes for your reviewer:
PR on the ci infra kubespray/kspray-infra#1 (private repo, maintainers have access)

Does this PR introduce a user-facing change?:

NONE

/label tide/merge-method-merge

k8s-ci-robot · 2024-09-13T10:58:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: VannTen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [VannTen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

VannTen · 2024-09-13T11:02:25Z

/ok-to-test

VannTen · 2024-09-13T11:03:12Z

/cc @ant31

VannTen · 2024-09-13T12:44:52Z

/retest
(now that I fixed the gitlab-runner config)

VannTen · 2024-09-13T19:27:40Z

/retest

VannTen · 2024-09-20T13:46:10Z

/label ci-full

(To test it works correctly for everything)

k8s-ci-robot · 2024-09-20T13:46:13Z

@VannTen: The label(s) /label ci-full cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label ci-full

(To test it works correctly for everything)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

VannTen · 2024-09-20T13:53:35Z

@ant31 What's the process to add the ci-{extended,full} label ? Can't find them with a quick grepping

ant31 · 2024-09-23T09:05:55Z

I think the PR to add them via /label is still in review

For now you can add them manually

VannTen · 2024-09-23T09:11:11Z

I'll do that once the initial set of tests pass then 👍

VannTen · 2024-10-19T15:23:12Z

The ssh DNS errors are weird, it should not use the name of the VMI but the IP, I think from kubevirt dynamic inventory.

VannTen · 2024-10-19T15:32:40Z

/retest-failed This is to test whether it's something racey or systematic on those tests /hold

tico88612 · 2024-10-21T15:30:27Z

/retest-failed

tico88612 · 2024-10-21T15:40:51Z

@ant31 Sometimes, /retest-failed cannot retest the partially failed jobs.
Is there any other way to solve this than /retest? (/retest will refresh the status, but it wastes time.)

tico88612 · 2024-10-21T17:21:16Z

The version of Cilium used by packet_debian11-custom-cni is outdated. (1.13.0)
Maybe fixed by #11654

ant31 · 2024-10-25T14:13:40Z

/retest-failed

ant31 · 2024-10-30T08:55:06Z

I guess ci-full is broken on all PR. Let's fix it on a different PR.

ant31 · 2024-10-30T08:55:29Z

/retest

VannTen · 2024-11-04T09:58:53Z

Retrying a test with the error
test-vm-5mz4g | UNREACHABLE! => {
"changed": false,
"msg": "Failed to connect to the host via ssh: ssh: Could not resolve hostname test-vm-5mz4g: Name or service not known",
"unreachable": true
}
got rid of the error.

Which isn't good because it seems to be random/racey

Same with:
TASK [Wait until SSH is available] *********************************************
fatal: [test-vm-7ll6t -> localhost]: FAILED! => {"changed": false, "elapsed": 240, "msg": "Timeout when waiting for localhost:22"}
fatal: [test-vm-pcmvv -> localhost]: FAILED! => {"changed": false, "elapsed": 240, "msg": "Timeout when waiting for localhost:22"}

In that case the VM have IPs 🤔

This allows a single source of truth for the virtual machines in a kubevirt ci-run. `etcd_member_name` should be correctly handled in kubespray-defaults for testing the recover cases.

We should not rollback our test setup during upgrade test. The only reason to do that would be for incompatible changes in the test inventory, and we already checkout master for those (${CI_JOB_NAME}.yml) Also do some cleanup by removing unnecessary intermediary variables

The new CI does not define k8s_cluster group, so it relies on kubernetes-sigs#11559. This does not work for upgrade testing (which use the previous release). We can revert this commit after 2.27.0

https://docs.ansible.com/ansible/latest/reference_appendices/config.html#envvar-ANSIBLE_VERBOSITY

increase ansible verbosity for debugging kubevirt dynamic inventory

VannTen · 2024-11-04T13:50:41Z

It looks like the ip of VirtualMachineInstance are disappearing and I think the kubevirt dynamic inventory interpret that as using the name for ansible_host.

This might be related to kubevirt/kubevirt#12698 , the guest OS on which this happens match the one in the issue description (alma / rocky) (and opensuse, not in that issue).

VannTen · 2024-11-04T14:04:48Z

By doing a kubectl get vmis -A --watch during a CI runs, I see stuff like this

gitlab-runner   test-vm-8g65n                                       25s   Running      10.11.195.176   k3         True
gitlab-runner   test-vm-8g65n                                       25s   Running                      k3         True

So this really looks like the IP momentarily dissapear

ant31 · 2024-11-05T08:35:43Z

we probably want to add some delay here and there and have some kind of retry at this step

VannTen · 2024-11-05T09:04:48Z

That's what the last commit does. But if the retry works and then the IP disappears, we're back where we started ^. Can't think of an effective workaround for now 🤔

ant31 · 2024-11-05T12:56:32Z

maybe upgrading kubevirt ?

VannTen · 2024-11-05T12:58:42Z

Maybe. But the linked bug being still open is not making me very hopeful... (there is always a change we're not hitting that specifically but I'm not counting on it).

k8s-ci-robot requested review from tico88612 and yankay September 13, 2024 10:58

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 13, 2024

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Sep 13, 2024

k8s-ci-robot requested a review from ant31 September 13, 2024 11:03

VannTen force-pushed the ci/cleanup_with_k8s_gc branch 2 times, most recently from fd43216 to d70f8e2 Compare September 20, 2024 13:45

VannTen force-pushed the ci/cleanup_with_k8s_gc branch from d70f8e2 to bbfd93a Compare September 20, 2024 13:50

VannTen force-pushed the ci/cleanup_with_k8s_gc branch 2 times, most recently from b67b0b2 to 25bb5d0 Compare September 20, 2024 14:56

VannTen mentioned this pull request Sep 21, 2024

Only require minimum structure in inventory, compute the rest #11559

Merged

VannTen force-pushed the ci/cleanup_with_k8s_gc branch from 25bb5d0 to 32a0dfc Compare September 23, 2024 10:17

VannTen force-pushed the ci/cleanup_with_k8s_gc branch 3 times, most recently from 81ac78a to d1ca52f Compare October 4, 2024 07:18

tico88612 mentioned this pull request Oct 19, 2024

Discussion: Upgrade dependencies version policy #11644

Open

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 19, 2024

tico88612 mentioned this pull request Oct 21, 2024

Fix debian11-custom-cni failing test & upgrade debian12-custom-cni-helm chart version #11654

Merged

VannTen mentioned this pull request Oct 25, 2024

Remove shell module usage from CI testcases #11667

Merged

ant31 removed the ci-full Run every available tests label Oct 30, 2024

VannTen force-pushed the ci/cleanup_with_k8s_gc branch from e732e60 to 1c02b25 Compare November 4, 2024 12:20

VannTen added 5 commits November 4, 2024 14:31

CI: use kubevirt.core dynamic inventory

9ef618f

This allows a single source of truth for the virtual machines in a kubevirt ci-run. `etcd_member_name` should be correctly handled in kubespray-defaults for testing the recover cases.

CI: workaround for upgrade test backward compatibility

b449f31

The new CI does not define k8s_cluster group, so it relies on kubernetes-sigs#11559. This does not work for upgrade testing (which use the previous release). We can revert this commit after 2.27.0

CI: directly use ANSIBLE_VERBOSITY instead of tweaking command line

2ea7cf1

https://docs.ansible.com/ansible/latest/reference_appendices/config.html#envvar-ANSIBLE_VERBOSITY

DO NOT MERGE

3af5007

increase ansible verbosity for debugging kubevirt dynamic inventory

VannTen force-pushed the ci/cleanup_with_k8s_gc branch from 1c02b25 to 3af5007 Compare November 4, 2024 13:31

CI - Workaround instable IPs in kubevirt VMIs

13f0332

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Use Kubernetes GC to clean kubevirt VMs (packet-* jobs) #11530

[CI] Use Kubernetes GC to clean kubevirt VMs (packet-* jobs) #11530

VannTen commented Sep 13, 2024

k8s-ci-robot commented Sep 13, 2024

VannTen commented Sep 13, 2024

VannTen commented Sep 13, 2024

VannTen commented Sep 13, 2024

VannTen commented Sep 13, 2024 via email

VannTen commented Sep 20, 2024

k8s-ci-robot commented Sep 20, 2024

VannTen commented Sep 20, 2024 via email

ant31 commented Sep 23, 2024

VannTen commented Sep 23, 2024

VannTen commented Oct 19, 2024 via email

VannTen commented Oct 19, 2024 via email

tico88612 commented Oct 21, 2024

tico88612 commented Oct 21, 2024

tico88612 commented Oct 21, 2024 •

edited

Loading

ant31 commented Oct 25, 2024

ant31 commented Oct 30, 2024

ant31 commented Oct 30, 2024

VannTen commented Nov 4, 2024

VannTen commented Nov 4, 2024

VannTen commented Nov 4, 2024

ant31 commented Nov 5, 2024 •

edited

Loading

VannTen commented Nov 5, 2024 via email

ant31 commented Nov 5, 2024

VannTen commented Nov 5, 2024 via email

[CI] Use Kubernetes GC to clean kubevirt VMs (packet-* jobs) #11530

Are you sure you want to change the base?

[CI] Use Kubernetes GC to clean kubevirt VMs (packet-* jobs) #11530

Conversation

VannTen commented Sep 13, 2024

k8s-ci-robot commented Sep 13, 2024

VannTen commented Sep 13, 2024

VannTen commented Sep 13, 2024

VannTen commented Sep 13, 2024

VannTen commented Sep 13, 2024 via email

VannTen commented Sep 20, 2024

k8s-ci-robot commented Sep 20, 2024

VannTen commented Sep 20, 2024 via email

ant31 commented Sep 23, 2024

VannTen commented Sep 23, 2024

VannTen commented Oct 19, 2024 via email

VannTen commented Oct 19, 2024 via email

tico88612 commented Oct 21, 2024

tico88612 commented Oct 21, 2024

tico88612 commented Oct 21, 2024 • edited Loading

ant31 commented Oct 25, 2024

ant31 commented Oct 30, 2024

ant31 commented Oct 30, 2024

VannTen commented Nov 4, 2024

VannTen commented Nov 4, 2024

VannTen commented Nov 4, 2024

ant31 commented Nov 5, 2024 • edited Loading

VannTen commented Nov 5, 2024 via email

ant31 commented Nov 5, 2024

VannTen commented Nov 5, 2024 via email

tico88612 commented Oct 21, 2024 •

edited

Loading

ant31 commented Nov 5, 2024 •

edited

Loading