Skip to content

Commit

Permalink
Merge pull request #616 from jasonnovichRunAI/v2.16-RUN-14041-Workloa…
Browse files Browse the repository at this point in the history
…ds-rewrite

V2.16-RUN-14041-Workloads-rewrite
  • Loading branch information
jasonnovichRunAI authored Jan 2, 2024
2 parents 3e4628b + 6c197ae commit 23ec975
Show file tree
Hide file tree
Showing 2 changed files with 223 additions and 0 deletions.
125 changes: 125 additions & 0 deletions docs/admin/workloads/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
---
title: Workloads Overview
summary: This article describes the Workloads page.
authors:
- Jason Novich
date: 2023-Dec-26
---

## Overview

Run:ai *Workloads* is specifically designed and optimized for AI and data science workloads, enhancing Kubernetes management of containerized applications. Run:ai augments Kubernetes workloads with additional resources crucial for AI pipelines (for example, Compute resources, NW, and storage).

Runai is an open platform and supports three types of workloads each with a different set of features:

1. Run:ai native workloads (trainings, workspaces, deployments)
1. Submit via UI/CLI
2. Actions - Delete/stop/connect
3. Policies (defaults and enforcing rules)
4. Scheduling rules
5. RBAC
2. Framework integrations
1. Smart gang scheduling (workload aware)
2. Specific workload aware visibility (GPU Utilization, workload view, dashboards)
3. Typical Kubernetes workloads
1. Scheduling capabilities:
1. Fairness
2. Nodepools
3. Bin packing/spread
2. Core capabilities:
1. Fractions
2. Overprovisioning

To enable the *Workloads* view, press *Jobs* and then press *Try Workloads*. To return to the *Jobs* view, press *Go Back To Jobs View*.

## Workloads Monitoring

Run:ai makes it easy to run machine learning workloads effectively on Kubernetes. Run:ai provides both a UI and API interface that introduces a simple and more efficient way to manage machine learning workloads, which will appeal to data scientists and engineers alike. The new UI is not just a cosmetic change, it is the gateway to several enhancements that improve the workload management experience.

### API Documentation

Access the platform [API documentation](https://app.run.ai/api/docs){target=_blank} for more information on using the API to manage workloads.

## Workloads View

The Workloads view provides a more advanced UI than the previous Jobs UI. The new table format provides:

* Changing of the layout of the *Workloads* table by pressing *Columns* to add or remove columns from the table.
* Download the table to a CSV file by pressing *More*, then pressing *Download as CSV*.
* Search for a workload by pressing *Search* and entering the name of the workload.
* Advanced workload management.

To create new workloads, press [*New Workload*](submitting-workloads.md).

## Manging Workloads

You can manage a workload by selecting one from the table. Once selected, you can:

* Delete a workload
* Connect
* Stop a workload
* Activate a workload
* Show details—provides in-depth information about the selected workload including:

* Event history—workload status over time. Use the filter to search through the history for specific events.
* Metrics—metrics for GPU utilization, CPU usage, GPU memory usage, and CPU memory usage. Use the date selector to choose the time period for the metrics.
* Logs—logs of the current status. Use the Download button to download the logs.

### Workloads Status

The *Status* column shows the current status of the workload. The following table describes the statuses presented:

| **Phase Name** | **Description** | **Entry Condition** | **Exit Condition** |
| --- | --- | --- | --- |
| **Creating** | We are initiating workload setup in the cluster. Resources and pods are now provisioning. | A workload is submitted. | A pod group is created—handling of multi-pod groups TBD. |
| **Pending** | Workload is queued and awaiting resource allocation. | A pod group exists. | All pods are scheduled—handling of multi-pod groups TBD. |
| **Initializing** | Workload is setting up: retrieving images, starting containers, and preparing pods. | All pods are scheduled—handling of multi-pod groups TBD. | All pods are initialized or a failure to initialize is detected. |
| **Running** | Workload is currently in progress with all pods operational. | All pods initialized (all containers in pods are ready). | Job completion or failure. |
| **Degraded** | The workload is underperforming: pods may not align with specifications, network services might be incomplete, or persistent volumes could be detached. Refer to logs for detailed information. | Pending: All pods are running but with issues. Running: All pods are running with no issues. | Running: All resources are OK. Completed: Job finished with fewer resources. Failed: Job failure or user-defined rules. |
| **Deleting** | Workload and its associated resources are being decommissioned from the cluster. | Decision made to delete the workload. | Resources are fully deleted. |
| **Stopped** | Workload is on hold; resources are intact but inactive. | An operational decision is made to stop the workload without deleting resources. | Transitioning back to the initializing phase or to deletion. |
| **Failed** | Workload encountered errors: image retrieval failed or containers experienced a crash. Consult logs for specifics. | An error occurs preventing the successful completion of the job. | Terminal state. |
| **Completed** | Workload has successfully finished its execution. | The job has finished processing without errors. | Terminal state. |

### Successful flows

A successful flow will follow the following flow chart:

```mermaid
flowchart LR
A(Creating) --> B(Pending)
B-->C(Initializing)
C-->D(Running)
D-->E(Completed)
```

#### Single pod

#### Distributed

### Error flows

## Runai Native Workloads

To get the full experience of Run:ai’s environment and platform use the following types of workloads.

* [Workspaces](../../Researcher/user-interface/workspaces/overview.md#getting-familiar-with-workspaces)
* [Trainings](../../Researcher/user-interface/trainings.md#trainings) (Only available when using the *Jobs* view)
* [Distributed trainings](../../Researcher/user-interface/trainings.md#trainings)
* [Deployment](../admin-ui-setup/deployments.md#viewing-and-submitting-deployments)

## Supported integrations

To assist you with other platforms, and other types of workloads use the integrations listed below.

1. [Airflow](https://docs.run.ai/v2.13/admin/integration/airflow/)
2. [MLflow](https://docs.run.ai/v2.13/admin/integration/mlflow/)
3. [Kubeflow](https://docs.run.ai/v2.13/admin/integration/kubeflow/)
4. [Seldon Core](https://docs.run.ai/v2.13/admin/integration/seldon/)
5. [Spark](https://docs.run.ai/v2.13/admin/integration/spark/)
6. [Ray](https://docs.run.ai/v2.13/admin/integration/ray/)
7. [KubeVirt (VM)](https://docs.run.ai/v2.13/admin/integration/kubevirt/)

## Standard Kubernetes Workloads

You can still enjoy the Run:ai platform when you submit standard Kubernetes workloads. Feel free to download or build you own.
98 changes: 98 additions & 0 deletions docs/admin/workloads/submitting-workloads.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
title: Submitting Workloads
summary: This article describes how to submit a workload using the workloads V2 form.
authors:
- Jason Novich
date: 2023-Dec-26
---

## How to Submit a Workload

To submit a workload using the UI:

1. In the left menu press *Workloads*.
2. Press *New Workload*, and select *Workspace* or *Training*.

=== "Workspace"

1. In the *Projects* pane, select a project. Use the search box to find projects that are not listed. If you can't find the project, see your system administrator.
2. In the *Templates* pane, select a template from the list. Use the search box to find templates that are not listed. If you can't find the specific template you need, create a new one, or see your system administrator.
3. Enter a `Workspace` name, and press continue.
4. In the *Environment* pane select or [create a new environment](../../Researcher/user-interface/workspaces/create/create-env.md). Use the search box to find environments that are not listed.
5. In the *Compute resource* pane, select resources for your tranings or [create a new compute resource](../../Researcher/user-interface/workspaces/create/create-compute.md). Use the search box to find resources that are not listed. Press *More settings* to use **Node Affinity** to limit the resources to a specific node.
6. Open the *Volume* pane, and press *Volume* to add a volume to your training.

1. Select the *Storage class* from the dropdown.
2. Select the *Access mode* from the dropdown.
3. Enter a claim size, and select the units.
4. Select a *Volume system*, mode from the dropdown.
5. Enter the *Container path* for volume target location.
6. Select a *Volume persistency.

7. In the *Data sources* pane, press *add a new data source*. For more information, see [Creating a new data source](../../Researcher/user-interface/workspaces/create/create-ds.md) When complete press, *Create Data Source*.
8. In the *General* pane, add special settings for your training (optional):

1. Press *Auto-deletion* to delete the training automatically when it either completes or fails. You can configure the timeframe in days, hours, minuets, and seconds. If the timeframe is set to 0, the training will be deleted immediately after it completes or fails.
2. Press *Annotation* to a name and value to annotate the training. Repeat this step to add multiple annotations.
3. Press *Label* to a name and value to label the training. Repeat this step to add multiple labels.

9. When complete, press *Create workspace.

=== "Training"

1. In the *Projects* pane, select the destination project. Use the search box to find projects that are not listed. If you can't find the project, you can create your own, or see your system administrator.
2. In the *Multi-node* pane, choose `Single node` for a single node training, or `Multi-node (distributed)` for distributed training. When you choose `Multi-node`, select a framework that is listed, then select the `multi-node` training configuration by selecting either `Workers & master` or `Workers only`.
3. In the *Templates* pane, select a template from the list. Use the search box to find templates that are not listed. If you can't find the specific template you need, see your system administrator.
4. In the *Training name* pane, enter a name for the *Traninng*, then press continue.
5. In the *Environment* pane select or [create a new environment](../../Researcher/user-interface/workspaces/create/create-env.md). Use the search box to find environments that are not listed. Press *More settings* to add an `Environment variable` or to edit the *Command* and *Arguments* field for the environment you selected.
6. In the *Compute resource* pane:

1. Select the number of workers for your training.
2. Select *Compute resources* for your training or [create a new compute resource](../../Researcher/user-interface/workspaces/create/create-compute.md). Use the search box to find resources that are not listed. Press *More settings* to use **Node Affinity** to limit the resources to a specific node.

!!! Note
The number of compute resources for the workers is based on the number of workers selected.

7. (Optional) Open the *Volume* pane, and press *Volume* to add a volume to your training.

1. Select the *Storage class* from the dropdown.
2. Select the *Access mode* from the dropdown.
3. Enter a claim size, and select the units.
4. Select a *Volume system*, mode from the dropdown.
5. Enter the *Container path* for volume target location.
6. Select a *Volume persistency.

8. (Optional) In the *Data sources* pane, press *add a new data source*. For more information, see [Creating a new data source](../../Researcher/user-interface/workspaces/create/create-ds.md) When complete press, *Create Data Source*.
9. (Optional) In the *General* pane, add special settings for your training (optional):

1. Press *Auto-deletion* to delete the training automatically when it either completes or fails. You can configure the timeframe in days, hours, minuets, and seconds. If the timeframe is set to 0, the training will be deleted immediately after it completes or fails.
2. Press *Annotation* to a name and value to annotate the training. Repeat this step to add multiple annotations.
3. Press *Label* to a name and value to label the training. Repeat this step to add multiple labels.

10. If you if selected `Workers & master` Press *Continue* to `Configure the master` and go to the next step. If not, then press *Create training*.

11. If you do not want a different setup for the master, press *Create training*. If you would like to have a different setup for the master, toggle the switch to enable to enable a different setup.

1. In the *Environment* pane select or [create a new environment](../../Researcher/user-interface/workspaces/create/create-env.md). Use the search box to find environments that are not listed. Press *More settings* to add an `Environment variable` or to edit the *Command* and *Arguments* field for the environment you selected.
2. In the *Compute resource* pane, select a *Compute resources* for your training or [create a new compute resource](../../Researcher/user-interface/workspaces/create/create-compute.md). Use the search box to find resources that are not listed. Press *More settings* to use **Node Affinity** to limit the resources to a specific node.
3. (Optional) Open the *Volume* pane, and press *Volume* to add a volume to your training.

1. Select the *Storage class* from the dropdown.
2. Select the *Access mode* from the dropdown.
3. Enter a claim size, and select the units.
4. Select a *Volume system*, mode from the dropdown.
5. Enter the *Container path* for volume target location.
6. Select a *Volume persistency.

4. (Optional) In the *Data sources* pane, press *add a new data source*. For more information, see [Creating a new data source](../../Researcher/user-interface/workspaces/create/create-ds.md) When complete press, *Create Data Source*.
5. (Optional) In the *General* pane, add special settings for your training (optional):

1. Press *Auto-deletion* to delete the training automatically when it either completes or fails. You can configure the timeframe in days, hours, minuets, and seconds. If the timeframe is set to 0, the training will be deleted immediately after it completes or fails.
2. Press *Annotation* to a name and value to annotate the training. Repeat this step to add multiple annotations.
3. Press *Label* to a name and value to label the training. Repeat this step to add multiple labels.

12. When your training configuration is complete. press *Create training*.

## Workload Policies

As an administrator, you can set *Policies* on Workloads. Policies allow administrators to *impose restrictions* and set *default values* for Researcher Workloads. For more information see [Workload Policies](policies.md).

0 comments on commit 23ec975

Please sign in to comment.