-
Notifications
You must be signed in to change notification settings - Fork 523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNV-47438: ipam, virt: graduate persistent ips feature gate to GA #2058
CNV-47438: ipam, virt: graduate persistent ips feature gate to GA #2058
Conversation
Hello @maiqueb! Some important instructions when contributing to openshift/api: |
69d8740
to
8717515
Compare
/test verify |
@JoelSpeed o/ We have added periodic lanes to trigger the test for this feature gate in this PR, which we plan on GAing for 4.18. The tests are quite thorough, and were added in this other PR. Unfortunately, while the tests themselves pass, it seems the monitor tests fail constantly due to the monitor tests. We have been trying to get this sorted out, but everytime we look, another component of OpenShift Virt throws a new event / alarm / fails to adhere to conformance testing - OpenShift virt uses their own conformance testing framework, and the hypershift lanes we have in origin do not run monitor tests afaiu. This is a list of items we have gone through in the last weeks:
As a result, we plan on having the OpenShift Virt QE manually sign-off the feature is stable enough. This is a heads-up so you know our intentions. Do you foresee any issues with this path ? How does the process look like for the QE signoff ? Just setting |
Looking at prow I can only see two runs of the periodics you've added, that, doesn't seem right? Both have failed and seem to be complaining that there are no tests to run? We normally ask to see test results in sippy, could you see if you can find any test data in sippy for the tests you've added? TRT may be able to help you if you can't work this out |
I can't see any evidence of this, please link |
Surely; if you open the |
I'd say this is on 4.19, where I am running the cluster without TP feature set.
|
Ahh I see now, right, so these tests are passing, but don't show up in sippy, which means we can't actually track or visualise them. I've started some threads in slack in #forum-ocp-release-oversight. I'd like to understand why this isn't in sippy, once we have it in sippy, we can visualise your tests, and, looking at the results, it may actually satisfy the requirements to pass the verify naturally |
Oh nice. Hopefully :D |
@maiqueb: This pull request references CNV-47438 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@maiqueb: This pull request references CNV-47438 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Update from CNV QE: The scenario I ran for verifying persistent IP after restart:
For migration I ran almost the same scenario, with the difference of stopping the iperf3 session in both client and server before migrating the VM, and restarted it (using the same persistent IP address) after the migration was completed. |
/test verify |
/hold If we're facing SCC pinning failures on components, that's a very serious problem that in 4.19 can cause non-functional clusters with very high bandwidth costs as components attempt to create pods with unexpected SCCs that don't conform to PSA. This happens in customer environments with custom SCCs, so passing CI doesn't mean that the bug is "safe". It just means it is latent. This has happened in the past and was extremely costly. Get PRs open to fix the SCC pinning annotations. |
While the desire is to GA the feature in 4.18, if it breaks on 4.19, we've simply created a problem for ourselves. Let's get a PR up adding the annotations for the failing workloads in https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/58873/rehearse-58873-pull-ci-openshift-ovn-kubernetes-release-4.19-e2e-aws-ovn-virt-techpreview/1864239266813448192 |
We've got the ball rolling on these violations with kubevirt/cluster-network-addons-operator#1931 and more will be coming. These changes are important and lack of this pinning carries some risk in 4.18 and substantial risk in 4.19. Given 4.18 timeframes and a commitment to address the problem in 4.19, we can promote this feature once the rest of the testing functions. /hold cancel |
/retest |
The periodic this morning failed on a persistent IPs test, can you please investigate and let me know here why this failed this time? https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-techpreview-periodic/1864460089788731392 |
The TL;DR; is we're hitting this bug: https://issues.redhat.com/browse/OCPBUGS-42609 on the primary UDN feature. The failing test is using the following parameters:
We create a server pod - e2e-test-network-segmentation-e2e-jljp9/testpod-ip-10-0-38-150.ec2.internal - which is supposed to have 2 network attachments:
From the CNI result (which we can see on the OVN-K logs) we can see that only the default cluster network attachment was added:
The VM is trying to reach the server pod using the primary UDN, which was not plumbed into the pod. As indicated in the bug, this race can happen. I think I'll add some code to the tests after provisioning the NAD / UDN, and before provisioning any workload to assure that an OVN logical entity for the UDN was created. This means that the NAD/UDN was processed, and that while the network itself might not be ready, it will eventually be. A FWIW - the network segmentation feature gate is also prone to this race, and I expect to see this happening on their lanes as well. |
/hold holding so that we can ensure https://issues.redhat.com/browse/OCPBUGS-42609 is resolved so that the feature works reliably before we make it accessible by default. |
/hold cancel it's an issue with UDN enabled, but with something called a secondary network it seems to work reliably? It sure would be nice if everything worked at the same time.... |
/test verify |
/payload-aggregate periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-techpreview-periodic 10 |
@JoelSpeed: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b1380e20-b875-11ef-9100-1cf11bd6ba9c-0 |
2/10 jobs from the aggregate earlier had a failure specifically rated to this feature, see https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-api-2058-per[…]-e2e-aws-ovn-virt-techpreview-periodic/1867157953270779904 and https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-api-2058-per[…]-e2e-aws-ovn-virt-techpreview-periodic/1867157940671090688 |
/payload-aggregate periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-periodic 10 |
@JoelSpeed: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a5b3c130-cd55-11ef-9453-12bb6b8577d4-0 |
Signed-off-by: Miguel Duarte Barroso <[email protected]>
8717515
to
b701a1f
Compare
/payload-aggregate periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-periodic 10 |
@JoelSpeed: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/37cb4550-cd9e-11ef-850c-8b451ba1c95d-0 |
@maiqueb: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/payload-aggregate periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-periodic 10 |
@JoelSpeed: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d5a5f9b0-cf37-11ef-9b28-47182ea2adf8-0 |
Most of the reported failures are from the missing SCC / cluster broken. Would it make sense to temporarily add the |
Based on the aggregate run and current sippy results, I can see that the parts of this feature that do not rely on the There is an outstanding IOU from the virt folks to fix their SCC annotations, but we are not blocking this feature on that. /override ci/prow/verify /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed, maiqueb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/verify In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/label acknowledge-critical-fixes-only |
c1a063b
into
openshift:master
/cherry-pick release-4.18 |
@maiqueb: #2058 failed to apply on top of branch "release-4.18":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[ART PR BUILD NOTIFIER] Distgit: ose-cluster-config-api |
This PR graduates the
PersistentIPsForVirtualization
feature gate to GA.