Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHOAI 2.14 Upstream Sync #355

Merged

Conversation

VaishnaviHire
Copy link

@VaishnaviHire VaishnaviHire commented Sep 16, 2024

Jira Issues:

Description

How Has This Been Tested?

Screenshot or short clip

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

zdtsw and others added 21 commits September 16, 2024 12:57
- regenerate code and manifests
- checkin new external crd

Signed-off-by: Wen Zhou <[email protected]>
(cherry picked from commit 1f34806)
* deps: uplift crd-ref-docs to 0.1.0

- their release changed and support different os/arch binary

Signed-off-by: Wen Zhou <[email protected]>

* update: rewrite after review

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>
(cherry picked from commit 3ce4d58)
Signed-off-by: Wen Zhou <[email protected]>
(cherry picked from commit 742b09b)
…o#1205)

* fix: add check for all platform offering on components

- wait for component's deployment is ready before return to caller
- reduce logging
- prolong timeout for workbench, trustyai, dsp, kserve

Signed-off-by: Wen Zhou <[email protected]>

* fix: wrong log

Signed-off-by: Wen Zhou <[email protected]>

* test: increase timeout and move a bit of order

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>
Co-authored-by: ajaypratap003 <[email protected]>
(cherry picked from commit ab7e04e)
* test: update e2e testcases

- rename files
- add check on FTer CRD
- use label from labels.ODHAppPrefix
- update misleading log message
- re-order test cases
- rename DSC CR from e2e-test to e2e-test-dsc

Signed-off-by: Wen Zhou <[email protected]>

* fix: missing "/"

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>
(cherry picked from commit a611427)
* tests: e2e: do not print Trying again if cannot list deployments

The message is misleading, actually it returns error so will not
retry.

Signed-off-by: Yauheni Kaliuta <[email protected]>

* tests: e2e: testApplicationCreation: enable immediate for PollUntilContextTimeout

After recent update the check was moved after DSC Ready check. There
is no point to wait extra Retry Interval for deployment check.

It decreases execution time.

Signed-off-by: Yauheni Kaliuta <[email protected]>

* tests: e2e: remove check for dependent operator

The intention for the check originally was to return success if no
deployments for component found due to missing operator (probably
with a bug in actual check).

Anyway, the case is not valid anymore, dependent operators are
installed.

Signed-off-by: Yauheni Kaliuta <[email protected]>

* tests: e2e: testApplicationCreation: early return from the loop

Simplify deployments readiness check loop with early return when
problematic deployment found and successful return as the default.

The len() == 0 check is needed otherwise it will declare success for
no deployements. It will be amended with enabled/disabled check in a
followup.

Signed-off-by: Yauheni Kaliuta <[email protected]>

---------

Signed-off-by: Yauheni Kaliuta <[email protected]>
(cherry picked from commit fc005e3)
…l and network impact (opendatahub-io#1190)

* feat: improve Operator performance by using caching to reduce api call
 and network impact
(work based on opendatahub-io#1189)
- secret: default application namespace + other default ones + istio cert
- configmap: all
- namespace: all
- ingressctrler: "default" one
- deployment: default application namespaces + default namespaces

Signed-off-by: Wen Zhou <[email protected]>

* revert: back out changes for webhook

Signed-off-by: Wen Zhou <[email protected]>

* update: move namespace for cache into functions and only add the ones for platform needs

Signed-off-by: Wen Zhou <[email protected]>

* update: rename functions

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>
(cherry picked from commit 5759f5e)
…atahub-io#1213)

* tests: e2e: do not wait deployements for not Managed components

The test retries if there are no deployements even if component is
disabled when this is a normal situation.

Make it to retry for disabled (explicitly Removed) component only if
there are deployments, but if it is not Managed instantly return
success.

It also means that the "disabled" check is inverted, it is not error
anymore.

It descreases execution time.

Signed-off-by: Yauheni Kaliuta <[email protected]>

* tests: e2e: increase resourceCreationTimeout

After some recent changes looks like it takes longer for
DataScienceCluster CR gets Ready status.

Signed-off-by: Yauheni Kaliuta <[email protected]>

---------

Signed-off-by: Yauheni Kaliuta <[email protected]>
(cherry picked from commit d57bb12)
…#1218)

- trustyai has introduced overlays for odh and rhoai
- we can switch to use these than base

Signed-off-by: Wen Zhou <[email protected]>
(cherry picked from commit a990dea)
- move resourceCreationgTimeout and resourceRetryInterval out of tc
    - rename resourceRetryInterval to generalRetryInterval
    - create different timeout instead of using resourceCreationgTimeout for all cases
    - move check on DSC ready out of testDSCCreation into validateDSCReady which calls waitDSCReady
    - remove 1 min sleep in testAllApplicationCreation
    - rename testAllApplicationCreation to testAllComponentCreation
    - rename  testApplicationCreation to testComponentCreation
    - do not immediate check func in testComponentCreation
    - rename testApplicationDeletion to testComponentDeletion
    - immediate check func in testComponentDeletion
    - rename waitForControllerDeployment to waitForOperatorDeployment
    - set new timeout, see const block
    - make check on CRD run in parallel

Signed-off-by: Wen Zhou <[email protected]>
(cherry picked from commit 199f08f)
… to lose the cache for istio-system ns (opendatahub-io#1223)

Signed-off-by: Wen Zhou <[email protected]>
(cherry picked from commit c49b501)
…opendatahub-io#1224)

- if DSC CR has old format with empty devFlag:{}, it can cause downstream image not replace but run with quay.io one

Signed-off-by: Wen Zhou <[email protected]>
(cherry picked from commit 1dbe269)
* feat: enable Trusty's Bias Feature Flag

- for upgrade case, if previously flag is true, after install new version should set to false

Signed-off-by: Wen Zhou <[email protected]>

* - rename cluster.GetRelease to cluster.GetReleaseFromCSV()
- introduce upgrade.GetReleaseFromCR()
- create upgradeODCCR() for calling unsetOwnerReference and updateODCBiasMetrics
- updateODSCBiasMetrics() only patch if previous Release is lower than 2.14, e.g 2.13.1, 2.8.3

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>
(cherry picked from commit 562efd1)
…pendatahub-io#1228)

* feat(tracker): feature tracker can have ownerReferences

Introduces ability to define owner reference for FeatureTracker (internal
custom resource). This enables garbage collection of resources belonging
to the given Feature when its owner such as DSC or DSCI has been removed.

When Features are grouped through means of `FeatureHandler`s, the
ownership is defined under the hood, linking it to DSCI or DSC,
depending on the type used.

In addition, this PR simplifies how Feature relies on `client.Client`
instance. Instead of creating its own instance it expects to have one
passed as part of the builder call chain. It creates a default one otherwise.
With that we can rely on a client configured for controllers with the
right schemas, caching rules etc. As a consequence the change of retry
logic for status needed to adjusted. Besides a conflict it needs to be
triggered on NotFound error, as shared client's cache might not have yet
an instance of FeatureTracker created just before the condition is
reported. The reason why it was working before is the fact that Feature
has its own simple client instance w/o caching.

> [!IMPORTANT]
>
> With DSCI and DSC becoming owners of particular FeatureTrackers
> `func (c *Component) Cleanup` called in finalizer block becomes
> somewhat speculative. The only use case where custom function is
> invoked is unpatching SMCP authorization provider. This was based
> on early assumption that we might want to plug into customer's
> existing SMCP instance. It's unclear to me if this is still long-term
> thinking so we might want to revisit need for this function.
>
> From the completness point of view, if we allow to create/manipulate
> resources programatically as part of the Feature DSL, we should be able
> to register cleanup counter-part functions, especially if we cannot simply
> remove them (as this is handled through ownership of associated
> FeatureTracker).
>
> There is, however, a need to perform cleanup when particular
> component is "Removed" (not managed anymore). Right now this is
> handled inside its Reconcile function.

Fixes [RHOAIENG-8629](https://issues.redhat.com/browse/RHOAIENG-8629)

Relates to opendatahub-io#1192 (comment)

Co-authored-by: Bartosz Majsak <[email protected]>

* fix: no need to return dsci

---------

Co-authored-by: Cameron Garrison <[email protected]>
(cherry picked from commit 10dd554)
…ub-io#1232)

* upgrade: rename GetReleaseFromCR to GetDeployedRelease

GetReleaseFromCR name is more about implementation than the task it
performs. The function fetches release from DSC/DSCI mean the latest
release the data science cluster was deployed by. Consider it as
"deployed release".

Signed-off-by: Yauheni Kaliuta <[email protected]>

* cluster: rename GetReleaseFromCSV back to GetRelease

GetReleaseFromCSV name is more about implementation than the task it
performs. The function fetches actual running operator release and
CSV is only one possibility for that.

Signed-off-by: Yauheni Kaliuta <[email protected]>

---------

Signed-off-by: Yauheni Kaliuta <[email protected]>
(cherry picked from commit b4c880a)
…1222)

- we do not support other values for now, this is to prevent user change to different value and cause old and new both running

Signed-off-by: Wen Zhou <[email protected]>
(cherry picked from commit ff7d131)
Passing name and namespace is redundant, as it is always the same as
newSecret passed to this func as well.

This change simplifies function signature and uses
`client.ObjectKeyFromObject` utility function to fetch existing secret
instance from the cluster.

In addition error variables has been renamed to avoid shadowing.

(cherry picked from commit fc789a9)
…1235)

Ensures each tests uses different DSCI instance as a test fixture.
Without this tests are failing (as they do in latest incubation commit),
as opendatahub-io#1222 introduced validation check to ensure AppNamespace is
immutable, but opendatahub-io#1228 is trying to update DSCI violating new constraint.

With this change each test uses dedicated DSCI.

(cherry picked from commit df1add0)
Including the client in the function signature makes its usage
explicit and more idiomatic. This approach clarifies that the
client is intended for use within the function. Decoupling the
client from Feature struct  enhances code readability.

It's a follow-up to PR
[opendatahub-io#1228](opendatahub-io#1228 (comment)).

(cherry picked from commit fde5d69)
@zdtsw
Copy link

zdtsw commented Sep 17, 2024

thought https://issues.redhat.com/browse/RHOAIENG-349 is already on main from #337

@VaishnaviHire
Copy link
Author

@ykaliuta Any idea on why the webhook tests might be failing?

@VaishnaviHire
Copy link
Author

thought https://issues.redhat.com/browse/RHOAIENG-349 is already on main from #337

It was linked in this PR, that I cherry-picked opendatahub-io#1205 . These changes do not seem to be included

@ykaliuta
Copy link

@ykaliuta Any idea on why the webhook tests might be failing?

Hmm, interesting, the test should use repository's manifests where it is enabled.

@ykaliuta
Copy link

@ykaliuta Any idea on why the webhook tests might be failing?

Hmm, interesting, the test should use repository's manifests where it is enabled.

It actually failing on getting DSCI/DSC. Wondering if there is any timing issue since it fetched by Kind

@ykaliuta
Copy link

It actually failing on getting DSCI/DSC. Wondering if there is any timing issue since it fetched by Kind

ok, requireInstalled() has some unexpected changes.

@@ -386,7 +383,7 @@ func (tc *testContext) testOwnerrefrences() error {
// Test Dashboard component
if tc.testDsc.Spec.Components.Dashboard.ManagementState == operatorv1.Managed {
appDeployments, err := tc.kubeClient.AppsV1().Deployments(tc.applicationsNamespace).List(tc.ctx, metav1.ListOptions{
LabelSelector: labels.ODH.Component("rhods-dashboard"),
LabelSelector: labels.ODH.Component(tc.testDsc.Spec.Components.Dashboard.GetComponentName()),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to keep the old one

@@ -442,7 +439,8 @@ func (tc *testContext) testDefaultCertsAvailable() error {
func (tc *testContext) testUpdateComponentReconcile() error {
// Test Updating Dashboard Replicas
appDeployments, err := tc.kubeClient.AppsV1().Deployments(tc.applicationsNamespace).List(tc.ctx, metav1.ListOptions{
LabelSelector: labels.ODH.Component("rhods-dashboard"),

LabelSelector: labels.ODH.Component(tc.testDsc.Spec.Components.Dashboard.GetComponentName()),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here need to keep the old one

@@ -493,7 +491,8 @@ func (tc *testContext) testUpdateDSCComponentEnabled() error {

if tc.testDsc.Spec.Components.Dashboard.ManagementState == operatorv1.Managed {
appDeployments, err := tc.kubeClient.AppsV1().Deployments(tc.applicationsNamespace).List(tc.ctx, metav1.ListOptions{
LabelSelector: labels.ODH.Component("rhods-dashboard"),

LabelSelector: labels.ODH.Component(tc.testDsc.Spec.Components.Dashboard.GetComponentName()),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here too

docs/DESIGN.md Outdated
@@ -62,6 +62,8 @@ To deploy ODH components seamlessly, ODH operator will watch two CRDs:
name: knative-serving
modelmeshserving:
managementState: Managed
modelregistry:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can skip this one in this PR

@@ -141,10 +141,10 @@ spec:
kueue:
managementState: Managed
trainingoperator:
managementState: Removed
managementState: Managed
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can skip change in this file

workbenches:
managementState: Managed
trustyai:
managementState: Removed
managementState: Managed
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@VaishnaviHire
Copy link
Author

Thanks @zdtsw @ykaliuta , the tests have passed now

Copy link

openshift-ci bot commented Sep 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: VaishnaviHire, zdtsw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [VaishnaviHire,zdtsw]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 78d42fd into red-hat-data-services:main Sep 18, 2024
10 checks passed
@VaishnaviHire
Copy link
Author

We need to also update manifests in midestream

@zdtsw
Copy link

zdtsw commented Sep 18, 2024

We need to also update manifests in midestream

i will fix this today


update: done in PR 2750

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants