-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Use Kubernetes GC to clean kubevirt VMs (packet-* jobs) #11530
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: VannTen The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
/cc @ant31 |
/retest |
/retest
|
fd43216
to
d70f8e2
Compare
/label ci-full (To test it works correctly for everything) |
@VannTen: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
d70f8e2
to
bbfd93a
Compare
@ant31
What's the process to add the ci-{extended,full} label ? Can't find them with a quick grepping
|
b67b0b2
to
25bb5d0
Compare
I'll do that once the initial set of tests pass then 👍 |
25bb5d0
to
32a0dfc
Compare
81ac78a
to
d1ca52f
Compare
The ssh DNS errors are weird, it should not use the name of the VMI but the IP, I think from kubevirt dynamic inventory.
|
/retest-failed
This is to test whether it's something racey or systematic on those tests
/hold
|
/retest-failed |
@ant31 Sometimes, |
The version of Cilium used by |
/retest-failed |
I guess ci-full is broken on all PR. Let's fix it on a different PR. |
/retest |
Retrying a test with the error Which isn't good because it seems to be random/racey Same with: In that case the VM have IPs 🤔 |
e732e60
to
1c02b25
Compare
This allows a single source of truth for the virtual machines in a kubevirt ci-run. `etcd_member_name` should be correctly handled in kubespray-defaults for testing the recover cases.
We should not rollback our test setup during upgrade test. The only reason to do that would be for incompatible changes in the test inventory, and we already checkout master for those (${CI_JOB_NAME}.yml) Also do some cleanup by removing unnecessary intermediary variables
The new CI does not define k8s_cluster group, so it relies on kubernetes-sigs#11559. This does not work for upgrade testing (which use the previous release). We can revert this commit after 2.27.0
increase ansible verbosity for debugging kubevirt dynamic inventory
1c02b25
to
3af5007
Compare
It looks like the ip of VirtualMachineInstance are disappearing and I think the kubevirt dynamic inventory interpret that as using the name for ansible_host. This might be related to kubevirt/kubevirt#12698 , the guest OS on which this happens match the one in the issue description (alma / rocky) (and opensuse, not in that issue). |
By doing a
So this really looks like the IP momentarily dissapear |
we probably want to add some delay here and there and have some kind of retry at this step |
That's what the last commit does. But if the retry works and then the IP
disappears, we're back where we started ^. Can't think of an effective
workaround for now 🤔
|
maybe upgrading kubevirt ? |
Maybe. But the linked bug being still open is not making me very hopeful... (there is always a change we're not hitting that specifically but I'm not counting on it).
|
What type of PR is this?
/kind feature
What this PR does / why we need it:
We regularly have CI flakes where the job failed to delete k8s namespace in the CI cluster.
It's not much, but it's a little hiccup in the PR process which I'd like to eliminate.
I'm not sure what the exact reason is, probably some race between the jobs and the time between fetching the list of namespace and the deletion.
Regardless, a simpler way to delete the VMs is to let them be dependants (in the kubernetes sense) of the job pod. This way, once the job pod is deleted, kubernetes garbage collection in the CI cluster will take care of removing the associated VMs
Special notes for your reviewer:
PR on the ci infra kubespray/kspray-infra#1 (private repo, maintainers have access)
Does this PR introduce a user-facing change?:
/label tide/merge-method-merge