Skip to content

Commit

Permalink
Addding kfto component to odh operator (opendatahub-io#944)
Browse files Browse the repository at this point in the history
* Addding kfto component to odh operator

Signed-off-by: ted chang <[email protected]>

* update: add KFTO default into sample config + linter + branch

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: ted chang <[email protected]>
Signed-off-by: Wen Zhou <[email protected]>
Co-authored-by: Wen Zhou <[email protected]>
  • Loading branch information
tedhtchang and zdtsw committed Apr 23, 2024
1 parent 2f45093 commit 4230f6e
Show file tree
Hide file tree
Showing 17 changed files with 313 additions and 38 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,8 @@ spec:
managementState: Managed
ray:
managementState: Managed
trainingoperator:
managementState: Managed
workbenches:
managementState: Managed
```
Expand Down
4 changes: 4 additions & 0 deletions apis/datasciencecluster/v1/datasciencecluster_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ import (
"github.com/opendatahub-io/opendatahub-operator/v2/components/kueue"
"github.com/opendatahub-io/opendatahub-operator/v2/components/modelmeshserving"
"github.com/opendatahub-io/opendatahub-operator/v2/components/ray"
"github.com/opendatahub-io/opendatahub-operator/v2/components/trainingoperator"
"github.com/opendatahub-io/opendatahub-operator/v2/components/trustyai"
"github.com/opendatahub-io/opendatahub-operator/v2/components/workbenches"
)
Expand Down Expand Up @@ -75,6 +76,9 @@ type Components struct {

// TrustyAI component configuration.
TrustyAI trustyai.TrustyAI `json:"trustyai,omitempty"`

//Training Operator component configuration.
TrainingOperator trainingoperator.TrainingOperator `json:"trainingoperator,omitempty"`
}

// DataScienceClusterStatus defines the observed state of DataScienceCluster.
Expand Down
1 change: 1 addition & 0 deletions apis/datasciencecluster/v1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
Expand Up @@ -420,6 +420,49 @@ spec:
pattern: ^(Managed|Unmanaged|Force|Removed)$
type: string
type: object
trainingoperator:
description: Training Operator component configuration.
properties:
devFlags:
description: Add developer fields
properties:
manifests:
description: List of custom manifests for the given component
items:
properties:
contextDir:
default: ""
description: contextDir is the relative path to
the folder containing manifests in a repository
type: string
sourcePath:
default: ""
description: 'sourcePath is the subpath within contextDir
where kustomize builds start. Examples include
any sub-folder or path: `base`, `overlays/dev`,
`default`, `odh` etc.'
type: string
uri:
default: ""
description: uri is the URI point to a git repo
with tag/branch. e.g. https://github.com/org/repo/tarball/<tag/branch>
type: string
type: object
type: array
type: object
managementState:
description: "Set to one of the following values: \n - \"Managed\"
: the operator is actively managing the component and trying
to keep it active. It will only upgrade the component if
it is safe to do so \n - \"Removed\" : the operator is actively
managing the component and will not install it, or if it
is installed, the operator will try to remove it"
enum:
- Managed
- Removed
pattern: ^(Managed|Unmanaged|Force|Removed)$
type: string
type: object
trustyai:
description: TrustyAI component configuration.
properties:
Expand Down
3 changes: 3 additions & 0 deletions bundle/manifests/rhods-operator.clusterserviceversion.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,9 @@ metadata:
"ray": {
"managementState": "Managed"
},
"trainingoperator": {
"managementState": "Removed"
},
"workbenches": {
"managementState": "Managed"
}
Expand Down
42 changes: 22 additions & 20 deletions components/component.go
Original file line number Diff line number Diff line change
Expand Up @@ -103,26 +103,28 @@ func (c *Component) UpdatePrometheusConfig(_ client.Client, enable bool, compone
Namespace string `yaml:"namespace"`
} `yaml:"metadata"`
Data struct {
PrometheusYML string `yaml:"prometheus.yml"`
OperatorRules string `yaml:"operator-recording.rules"`
DeadManSnitchRules string `yaml:"deadmanssnitch-alerting.rules"`
CFRRules string `yaml:"codeflare-recording.rules"`
CRARules string `yaml:"codeflare-alerting.rules"`
DashboardRRules string `yaml:"rhods-dashboard-recording.rules"`
DashboardARules string `yaml:"rhods-dashboard-alerting.rules"`
DSPRRules string `yaml:"data-science-pipelines-operator-recording.rules"`
DSPARules string `yaml:"data-science-pipelines-operator-alerting.rules"`
MMRRules string `yaml:"model-mesh-recording.rules"`
MMARules string `yaml:"model-mesh-alerting.rules"`
OdhModelRRules string `yaml:"odh-model-controller-recording.rules"`
OdhModelARules string `yaml:"odh-model-controller-alerting.rules"`
RayARules string `yaml:"ray-alerting.rules"`
WorkbenchesRRules string `yaml:"workbenches-recording.rules"`
WorkbenchesARules string `yaml:"workbenches-alerting.rules"`
KserveRRules string `yaml:"kserve-recording.rules"`
KserveARules string `yaml:"kserve-alerting.rules"`
KueueRRules string `yaml:"kueue-recording.rules"`
KueueARules string `yaml:"kueue-alerting.rules"`
PrometheusYML string `yaml:"prometheus.yml"`
OperatorRules string `yaml:"operator-recording.rules"`
DeadManSnitchRules string `yaml:"deadmanssnitch-alerting.rules"`
CFRRules string `yaml:"codeflare-recording.rules"`
CRARules string `yaml:"codeflare-alerting.rules"`
DashboardRRules string `yaml:"rhods-dashboard-recording.rules"`
DashboardARules string `yaml:"rhods-dashboard-alerting.rules"`
DSPRRules string `yaml:"data-science-pipelines-operator-recording.rules"`
DSPARules string `yaml:"data-science-pipelines-operator-alerting.rules"`
MMRRules string `yaml:"model-mesh-recording.rules"`
MMARules string `yaml:"model-mesh-alerting.rules"`
OdhModelRRules string `yaml:"odh-model-controller-recording.rules"`
OdhModelARules string `yaml:"odh-model-controller-alerting.rules"`
RayARules string `yaml:"ray-alerting.rules"`
WorkbenchesRRules string `yaml:"workbenches-recording.rules"`
WorkbenchesARules string `yaml:"workbenches-alerting.rules"`
KserveRRules string `yaml:"kserve-recording.rules"`
KserveARules string `yaml:"kserve-alerting.rules"`
KueueRRules string `yaml:"kueue-recording.rules"`
KueueARules string `yaml:"kueue-alerting.rules"`
TrainingOperatorRRules string `yaml:"trainingoperator-recording.rules"`
TrainingOperatorARules string `yaml:"trainingoperator-alerting.rules"`
} `yaml:"data"`
}
var configMap ConfigMap
Expand Down
108 changes: 108 additions & 0 deletions components/trainingoperator/trainingoperator.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
// Package trainingoperator provides utility functions to config trainingoperator as part of the stack
// which makes managing distributed compute infrastructure in the cloud easy and intuitive for Data Scientists
package trainingoperator

import (
"context"
"fmt"
"path/filepath"

operatorv1 "github.com/openshift/api/operator/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"sigs.k8s.io/controller-runtime/pkg/client"

dsciv1 "github.com/opendatahub-io/opendatahub-operator/v2/apis/dscinitialization/v1"
"github.com/opendatahub-io/opendatahub-operator/v2/components"
"github.com/opendatahub-io/opendatahub-operator/v2/pkg/deploy"
"github.com/opendatahub-io/opendatahub-operator/v2/pkg/monitoring"
)

var (
ComponentName = "trainingoperator"
TrainingOperatorPath = deploy.DefaultManifestPath + "/" + ComponentName + "/rhoai"
)

// Verifies that TrainingOperator implements ComponentInterface.
var _ components.ComponentInterface = (*TrainingOperator)(nil)

// TrainingOperator struct holds the configuration for the TrainingOperator component.
// +kubebuilder:object:generate=true
type TrainingOperator struct {
components.Component `json:""`
}

func (r *TrainingOperator) OverrideManifests(_ string) error {
// If devflags are set, update default manifests path
if len(r.DevFlags.Manifests) != 0 {
manifestConfig := r.DevFlags.Manifests[0]
if err := deploy.DownloadManifests(ComponentName, manifestConfig); err != nil {
return err
}
// If overlay is defined, update paths
defaultKustomizePath := "openshift"
if manifestConfig.SourcePath != "" {
defaultKustomizePath = manifestConfig.SourcePath
}
TrainingOperatorPath = filepath.Join(deploy.DefaultManifestPath, ComponentName, defaultKustomizePath)
}

return nil
}

func (r *TrainingOperator) GetComponentName() string {
return ComponentName
}

func (r *TrainingOperator) ReconcileComponent(ctx context.Context, cli client.Client, owner metav1.Object, dscispec *dsciv1.DSCInitializationSpec, _ bool) error {
var imageParamMap = map[string]string{
"odh-training-operator-controller-image": "RELATED_IMAGE_ODH_TRAINING_OPERATOR_IMAGE",
"namespace": dscispec.ApplicationsNamespace,
}

enabled := r.GetManagementState() == operatorv1.Managed
monitoringEnabled := dscispec.Monitoring.ManagementState == operatorv1.Managed
platform, err := deploy.GetPlatform(cli)
if err != nil {
return err
}

if enabled {
if r.DevFlags != nil {
// Download manifests and update paths
if err = r.OverrideManifests(string(platform)); err != nil {
return err
}
}
if (dscispec.DevFlags == nil || dscispec.DevFlags.ManifestsUri == "") && (r.DevFlags == nil || len(r.DevFlags.Manifests) == 0) {
if err := deploy.ApplyParams(TrainingOperatorPath, imageParamMap, true); err != nil {
return err
}
}
}
// Deploy Training Operator
if err := deploy.DeployManifestsFromPath(cli, owner, TrainingOperatorPath, dscispec.ApplicationsNamespace, ComponentName, enabled); err != nil {
return err
}

// CloudService Monitoring handling
if platform == deploy.ManagedRhods {
if enabled {
// first check if the service is up, so prometheus wont fire alerts when it is just startup
if err := monitoring.WaitForDeploymentAvailable(ctx, cli, ComponentName, dscispec.ApplicationsNamespace, 20, 2); err != nil {
return fmt.Errorf("deployment for %s is not ready to server: %w", ComponentName, err)
}
fmt.Printf("deployment for %s is done, updating monitoring rules\n", ComponentName)
}
if err := r.UpdatePrometheusConfig(cli, enabled && monitoringEnabled, ComponentName); err != nil {
return err
}
if err = deploy.DeployManifestsFromPath(cli, owner,
filepath.Join(deploy.DefaultManifestPath, "monitoring", "prometheus", "apps"),
dscispec.Monitoring.Namespace,
"prometheus", true); err != nil {
return err
}
}

return nil
}
40 changes: 40 additions & 0 deletions components/trainingoperator/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
Expand Up @@ -421,6 +421,49 @@ spec:
pattern: ^(Managed|Unmanaged|Force|Removed)$
type: string
type: object
trainingoperator:
description: Training Operator component configuration.
properties:
devFlags:
description: Add developer fields
properties:
manifests:
description: List of custom manifests for the given component
items:
properties:
contextDir:
default: ""
description: contextDir is the relative path to
the folder containing manifests in a repository
type: string
sourcePath:
default: ""
description: 'sourcePath is the subpath within contextDir
where kustomize builds start. Examples include
any sub-folder or path: `base`, `overlays/dev`,
`default`, `odh` etc.'
type: string
uri:
default: ""
description: uri is the URI point to a git repo
with tag/branch. e.g. https://github.com/org/repo/tarball/<tag/branch>
type: string
type: object
type: array
type: object
managementState:
description: "Set to one of the following values: \n - \"Managed\"
: the operator is actively managing the component and trying
to keep it active. It will only upgrade the component if
it is safe to do so \n - \"Removed\" : the operator is actively
managing the component and will not install it, or if it
is installed, the operator will try to remove it"
enum:
- Managed
- Removed
pattern: ^(Managed|Unmanaged|Force|Removed)$
type: string
type: object
trustyai:
description: TrustyAI component configuration.
properties:
Expand Down
2 changes: 2 additions & 0 deletions config/samples/datasciencecluster_v1_datasciencecluster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ spec:
managementState: "Managed"
kueue:
managementState: "Managed"
trainingoperator:
managementState: "Removed"
ray:
managementState: "Managed"
workbenches:
Expand Down
10 changes: 7 additions & 3 deletions docs/DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,17 +47,21 @@ To deploy ODH components seamlessly, ODH operator will watch two CRDs:
spec:
components:
codeflare:
managementState: Removed
managementState: Managed
dashboard:
managementState: Managed
datasciencepipelines:
managementState: Managed
kserve:
managementState: Managed
modelmeshserving:
managementState: Removed
managementState: Managed
ray:
managementState: Removed
managementState: Managed
kueue:
managementState: Managed
trainingoperator:
managementState: Managed
workbenches:
managementState: Managed
```
Expand Down
3 changes: 2 additions & 1 deletion docs/api-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ _Appears in:_
- [Kueue](#kueue)
- [ModelMeshServing](#modelmeshserving)
- [Ray](#ray)
- [TrainingOperator](#trainingoperator)
- [TrustyAI](#trustyai)
- [Workbenches](#workbenches)

Expand Down Expand Up @@ -356,7 +357,7 @@ _Appears in:_
| `kueue` _[Kueue](#kueue)_ | Kueue component configuration. | | |
| `codeflare` _[CodeFlare](#codeflare)_ | CodeFlare component configuration.<br />If CodeFlare Operator has been installed in the cluster, it should be uninstalled first before enabled component. | | |
| `ray` _[Ray](#ray)_ | Ray component configuration. | | |
| `trustyai` _[TrustyAI](#trustyai)_ | TrustyAI component configuration. | | |
| `trainingoperator` _[TrainingOperator](#trainingoperator)_ | Training Operator component configuration. | | |


#### ControlPlaneSpec
Expand Down
Loading

0 comments on commit 4230f6e

Please sign in to comment.