Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Use Kubernetes GC to clean kubevirt VMs (packet-* jobs) #11530

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

VannTen
Copy link
Contributor

@VannTen VannTen commented Sep 13, 2024

What type of PR is this?
/kind feature

What this PR does / why we need it:
We regularly have CI flakes where the job failed to delete k8s namespace in the CI cluster.
It's not much, but it's a little hiccup in the PR process which I'd like to eliminate.

I'm not sure what the exact reason is, probably some race between the jobs and the time between fetching the list of namespace and the deletion.
Regardless, a simpler way to delete the VMs is to let them be dependants (in the kubernetes sense) of the job pod. This way, once the job pod is deleted, kubernetes garbage collection in the CI cluster will take care of removing the associated VMs

Special notes for your reviewer:
PR on the ci infra kubespray/kspray-infra#1 (private repo, maintainers have access)

Does this PR introduce a user-facing change?:

NONE

/label tide/merge-method-merge

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. tide/merge-method-merge Denotes a PR that should use a standard merge by tide when it merges. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 13, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: VannTen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 13, 2024
@VannTen
Copy link
Contributor Author

VannTen commented Sep 13, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Sep 13, 2024
@VannTen
Copy link
Contributor Author

VannTen commented Sep 13, 2024

/cc @ant31

@VannTen
Copy link
Contributor Author

VannTen commented Sep 13, 2024

/retest
(now that I fixed the gitlab-runner config)

@VannTen
Copy link
Contributor Author

VannTen commented Sep 13, 2024 via email

@VannTen VannTen force-pushed the ci/cleanup_with_k8s_gc branch 2 times, most recently from fd43216 to d70f8e2 Compare September 20, 2024 13:45
@VannTen
Copy link
Contributor Author

VannTen commented Sep 20, 2024

/label ci-full

(To test it works correctly for everything)

@k8s-ci-robot
Copy link
Contributor

@VannTen: The label(s) /label ci-full cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label ci-full

(To test it works correctly for everything)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@VannTen
Copy link
Contributor Author

VannTen commented Sep 20, 2024 via email

@ant31
Copy link
Contributor

ant31 commented Sep 23, 2024

I think the PR to add them via /label is still in review

For now you can add them manually
image

@VannTen
Copy link
Contributor Author

VannTen commented Sep 23, 2024

I'll do that once the initial set of tests pass then 👍

@VannTen VannTen force-pushed the ci/cleanup_with_k8s_gc branch 3 times, most recently from 81ac78a to d1ca52f Compare October 4, 2024 07:18
@VannTen
Copy link
Contributor Author

VannTen commented Oct 19, 2024 via email

@VannTen
Copy link
Contributor Author

VannTen commented Oct 19, 2024 via email

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 19, 2024
@tico88612
Copy link
Member

/retest-failed

@tico88612
Copy link
Member

@ant31 Sometimes, /retest-failed cannot retest the partially failed jobs.
Is there any other way to solve this than /retest? (/retest will refresh the status, but it wastes time.)

image

@tico88612
Copy link
Member

tico88612 commented Oct 21, 2024

The version of Cilium used by packet_debian11-custom-cni is outdated. (1.13.0)
Maybe fixed by #11654

@ant31
Copy link
Contributor

ant31 commented Oct 25, 2024

/retest-failed

@ant31 ant31 removed the ci-full Run every available tests label Oct 30, 2024
@ant31
Copy link
Contributor

ant31 commented Oct 30, 2024

I guess ci-full is broken on all PR. Let's fix it on a different PR.

@ant31
Copy link
Contributor

ant31 commented Oct 30, 2024

/retest

@VannTen
Copy link
Contributor Author

VannTen commented Nov 4, 2024

Retrying a test with the error
test-vm-5mz4g | UNREACHABLE! => {
"changed": false,
"msg": "Failed to connect to the host via ssh: ssh: Could not resolve hostname test-vm-5mz4g: Name or service not known",
"unreachable": true
}
got rid of the error.

Which isn't good because it seems to be random/racey

Same with:
TASK [Wait until SSH is available] *********************************************
fatal: [test-vm-7ll6t -> localhost]: FAILED! => {"changed": false, "elapsed": 240, "msg": "Timeout when waiting for localhost:22"}
fatal: [test-vm-pcmvv -> localhost]: FAILED! => {"changed": false, "elapsed": 240, "msg": "Timeout when waiting for localhost:22"}

In that case the VM have IPs 🤔

This allows a single source of truth for the virtual machines in a
kubevirt ci-run.

`etcd_member_name` should be correctly handled in kubespray-defaults for
testing the recover cases.
We should not rollback our test setup during upgrade test.
The only reason to do that would be for incompatible changes in the test
inventory, and we already checkout master for those (${CI_JOB_NAME}.yml)

Also do some cleanup by removing unnecessary intermediary variables
The new CI does not define k8s_cluster group, so it relies on
kubernetes-sigs#11559.

This does not work for upgrade testing (which use the previous release).
We can revert this commit after 2.27.0
increase ansible verbosity for debugging kubevirt dynamic inventory
@VannTen
Copy link
Contributor Author

VannTen commented Nov 4, 2024

It looks like the ip of VirtualMachineInstance are disappearing and I think the kubevirt dynamic inventory interpret that as using the name for ansible_host.

This might be related to kubevirt/kubevirt#12698 , the guest OS on which this happens match the one in the issue description (alma / rocky) (and opensuse, not in that issue).

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 4, 2024
@VannTen
Copy link
Contributor Author

VannTen commented Nov 4, 2024

By doing a kubectl get vmis -A --watch during a CI runs, I see stuff like this

gitlab-runner   test-vm-8g65n                                       25s   Running      10.11.195.176   k3         True
gitlab-runner   test-vm-8g65n                                       25s   Running                      k3         True

So this really looks like the IP momentarily dissapear

@ant31
Copy link
Contributor

ant31 commented Nov 5, 2024

we probably want to add some delay here and there and have some kind of retry at this step

@VannTen
Copy link
Contributor Author

VannTen commented Nov 5, 2024 via email

@ant31
Copy link
Contributor

ant31 commented Nov 5, 2024

maybe upgrading kubevirt ?

@VannTen
Copy link
Contributor Author

VannTen commented Nov 5, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-merge Denotes a PR that should use a standard merge by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants