Skip to content

Commit 416a11f

Browse files
committed
PUDN, static ips: add operational aspects, and upgrade strategy sections
Signed-off-by: Miguel Duarte Barroso <[email protected]>
1 parent 7ac84db commit 416a11f

File tree

1 file changed

+11
-115
lines changed

1 file changed

+11
-115
lines changed

enhancements/network/routed-ingress-support-for-primary-udn-attached-vms-with-static-ips.md

Lines changed: 11 additions & 115 deletions
Original file line numberDiff line numberDiff line change
@@ -698,133 +698,29 @@ The deprecation strategy is described in the OVN-Kubernetes
698698

699699
## Upgrade / Downgrade Strategy
700700

701-
If applicable, how will the component be upgraded and downgraded? Make sure this
702-
is in the test plan.
703-
704-
Consider the following in developing an upgrade/downgrade strategy for this
705-
enhancement:
706-
- What changes (in invocations, configurations, API use, etc.) is an existing
707-
cluster required to make on upgrade in order to keep previous behavior?
708-
- What changes (in invocations, configurations, API use, etc.) is an existing
709-
cluster required to make on upgrade in order to make use of the enhancement?
710-
711-
Upgrade expectations:
712-
- Each component should remain available for user requests and
713-
workloads during upgrades. Ensure the components leverage best practices in handling [voluntary
714-
disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to
715-
this should be identified and discussed here.
716-
- Micro version upgrades - users should be able to skip forward versions within a
717-
minor release stream without being required to pass through intermediate
718-
versions - i.e. `x.y.N->x.y.N+2` should work without requiring `x.y.N->x.y.N+1`
719-
as an intermediate step.
720-
- Minor version upgrades - you only need to support `x.N->x.N+1` upgrade
721-
steps. So, for example, it is acceptable to require a user running 4.3 to
722-
upgrade to 4.5 with a `4.3->4.4` step followed by a `4.4->4.5` step.
723-
- While an upgrade is in progress, new component versions should
724-
continue to operate correctly in concert with older component
725-
versions (aka "version skew"). For example, if a node is down, and
726-
an operator is rolling out a daemonset, the old and new daemonset
727-
pods must continue to work correctly even while the cluster remains
728-
in this partially upgraded state for some time.
729-
730-
Downgrade expectations:
731-
- If an `N->N+1` upgrade fails mid-way through, or if the `N+1` cluster is
732-
misbehaving, it should be possible for the user to rollback to `N`. It is
733-
acceptable to require some documented manual steps in order to fully restore
734-
the downgraded cluster to its previous state. Examples of acceptable steps
735-
include:
736-
- Deleting any CVO-managed resources added by the new version. The
737-
CVO does not currently delete resources that no longer exist in
738-
the target version.
701+
N/A
739702

740703
## Version Skew Strategy
741704

742705
N/A
743706

744707
## Operational Aspects of API Extensions
745708

746-
Describe the impact of API extensions (mentioned in the proposal section, i.e. CRDs,
747-
admission and conversion webhooks, aggregated API servers, finalizers) here in detail,
748-
especially how they impact the OCP system architecture and operational aspects.
749-
750-
- For conversion/admission webhooks and aggregated apiservers: what are the SLIs (Service Level
751-
Indicators) an administrator or support can use to determine the health of the API extensions
752-
753-
Examples (metrics, alerts, operator conditions)
754-
- authentication-operator condition `APIServerDegraded=False`
755-
- authentication-operator condition `APIServerAvailable=True`
756-
- openshift-authentication/oauth-apiserver deployment and pods health
757-
758-
- What impact do these API extensions have on existing SLIs (e.g. scalability, API throughput,
759-
API availability)
709+
The proposed `IPPool` CRD must be provisioned by the admin (or the source
710+
cluster introspection tool) before the VMs are migrated into OpenShift virt,
711+
otherwise, they will lose the IP addresses they had on the source cluster.
760712

761-
Examples:
762-
- Adds 1s to every pod update in the system, slowing down pod scheduling by 5s on average.
763-
- Fails creation of ConfigMap in the system when the webhook is not available.
764-
- Adds a dependency on the SDN service network for all resources, risking API availability in case
765-
of SDN issues.
766-
- Expected use-cases require less than 1000 instances of the CRD, not impacting
767-
general API throughput.
713+
The gateway for the network must be configured in the cluster UDN CR at
714+
creation time, as any other cluster UDN parameter.
768715

769-
- How is the impact on existing SLIs to be measured and when (e.g. every release by QE, or
770-
automatically in CI) and by whom (e.g. perf team; name the responsible person and let them review
771-
this enhancement)
772-
773-
- Describe the possible failure modes of the API extensions.
774-
- Describe how a failure or behaviour of the extension will impact the overall cluster health
775-
(e.g. which kube-controller-manager functionality will stop working), especially regarding
776-
stability, availability, performance and security.
777-
- Describe which OCP teams are likely to be called upon in case of escalation with one of the failure modes
778-
and add them as reviewers to this enhancement.
716+
Hence, some planning and preparation are required from the admin before the
717+
VM owner starts importing VMs into the OpenShift Virt cluster via MTV.
779718

780719
## Support Procedures
781720

782-
Describe how to
783-
- detect the failure modes in a support situation, describe possible symptoms (events, metrics,
784-
alerts, which log output in which component)
785-
786-
Examples:
787-
- If the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz".
788-
- Operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed".
789-
- The metric `webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")`
790-
will show >1s latency and alert `WebhookAdmissionLatencyHigh` will fire.
791-
792-
- disable the API extension (e.g. remove MutatingWebhookConfiguration `xyz`, remove APIService `foo`)
793-
794-
- What consequences does it have on the cluster health?
795-
796-
Examples:
797-
- Garbage collection in kube-controller-manager will stop working.
798-
- Quota will be wrongly computed.
799-
- Disabling/removing the CRD is not possible without removing the CR instances. Customer will lose data.
800-
Disabling the conversion webhook will break garbage collection.
801-
802-
- What consequences does it have on existing, running workloads?
803-
804-
Examples:
805-
- New namespaces won't get the finalizer "xyz" and hence might leak resource X
806-
when deleted.
807-
- SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod
808-
communication after some minutes.
809-
810-
- What consequences does it have for newly created workloads?
811-
812-
Examples:
813-
- New pods in namespace with Istio support will not get sidecars injected, breaking
814-
their networking.
815-
816-
- Does functionality fail gracefully and will work resume when re-enabled without risking
817-
consistency?
818-
819-
Examples:
820-
- The mutating admission webhook "xyz" has FailPolicy=Ignore and hence
821-
will not block the creation or updates on objects when it fails. When the
822-
webhook comes back online, there is a controller reconciling all objects, applying
823-
labels that were not applied during admission webhook downtime.
824-
- Namespaces deletion will not delete all objects in etcd, leading to zombie
825-
objects when another namespace with the same name is created.
721+
TODO
826722

827723
## Infrastructure Needed [optional]
828724

829-
Use this section if you need things from the project. Examples include a new
830-
subproject, repos requested, github details, and/or testing infrastructure.
725+
We'll need a virt-aware lane with CNV (and MTV) installed so we can e2e test
726+
the features.

0 commit comments

Comments
 (0)