Skip to content

Commit

Permalink
Merge pull request #1125 from run-ai/workloads-support-218
Browse files Browse the repository at this point in the history
Merge pull request #1124 from run-ai/workloads-support
  • Loading branch information
yarongol committed Sep 23, 2024
2 parents e83a30b + 12a8230 commit f0892ba
Show file tree
Hide file tree
Showing 15 changed files with 83 additions and 15 deletions.
2 changes: 1 addition & 1 deletion docs/Researcher/best-practices/convert-to-unattended.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ Please refer to [Command-Line Interface, runai submit](../cli-reference/runai-su

### Use CLI Policies

Different run configurations may vary significantly and can be tedious to be written each time on the command-line. To make life easier, our CLI offers a way to set administrator policies for these configurations and use pre-configured configuration when submitting a Workload. Please refer to [Configure Command-Line Interface Policies](../../platform-admin/workloads/policies/policies.md).
Different run configurations may vary significantly and can be tedious to be written each time on the command-line. To make life easier, our CLI offers a way to set administrator policies for these configurations and use pre-configured configuration when submitting a Workload. Please refer to [Configure Policies](../../platform-admin/workloads/policies/overview.md).

## Attached Files

Expand Down
2 changes: 1 addition & 1 deletion docs/Researcher/cli-reference/runai-submit.md
Original file line number Diff line number Diff line change
Expand Up @@ -449,4 +449,4 @@ Note that the submit call may use a *policy* to provide defaults to any of the a
## See Also

* See any of the Quickstart documents [here:](../Walkthroughs/quickstart-overview.md).
* See [policy configuration](../../platform-admin/workloads/policies/policies.md) for a description on how policies work.
* See [policy configuration](../../platform-admin/workloads/policies/overview.md) for a description on how policies work.
67 changes: 67 additions & 0 deletions docs/Researcher/workloads/workload-support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@

Workloads are the basic unit of work in Run:ai. Researchers and Engineers use workloads for every stage in their AI [Project](../../platform-admin/aiinitiatives/org/projects.md) lifecycle. Workloads can be used to build, train, or deploy a model. Run:ai supports all types of Kubernetes workloads. Researchers can work with any workload in their organization but will get the largest value working with Run:ai native workloads.

Run:ai offers three native types of workloads:

* Workspaces For building the model
* Training For training tasks of the model and data preparation
* Inference For deploying and serving the model

Run:ai native workloads can be created via the Run:ai User interface, [API](https://api-docs.run.ai/2.18/tag/Workloads) or [Command-line interface](../../Researcher/cli-reference/Introduction.md).

## Levels of support

Different types of workloads have different levels of support. Understanding what capabilities are needed before selecting the workload type to work with is important. The table below details the level of support for each workload type in Run:ai. The Run:ai native workloads are fully supported with all of Run:ai advanced features and capabilities. While third-party workloads are partially supported. The list of capabilities can change between different Run:ai versions.

| Functionality | Workload Type | | | | |
| ----- | :---: | :---: | :---: | :---: | ----- |
| | Run:ai workloads | | | | Third-party workloads |
| | Training - Standard | Workspace | Inference | Training - distributed | All K8s workloads |
| [Fairness](../../Researcher/scheduling/the-runai-scheduler.md#fairness) | v | v | v | v | v |
| [Priority and preemption](../../Researcher/scheduling/the-runai-scheduler.md#allocation--preemption) | v | v | v | v | v |
| [Over quota](../../Researcher/scheduling/the-runai-scheduler.md#over-quota-priority) | v | v | v | v | v |
| [Node pools](../../platform-admin/aiinitiatives/resources/node-pools.md) | v | v | v | v | v |
| Bin packing / Spread | v | v | v | v | v |
| Fractions | v | v | v | v | v |
| Dynamic fractions | v | v | v | v | v |
| Node level scheduler | v | v | v | v | v |
| GPU swap | v | v | v | v | v |
| Elastic scaling | NA | NA | v | v | v |
| [Gang scheduling](../../Researcher/scheduling/the-runai-scheduler.md#distributed-training) | v | v | v | v | v |
| [Monitoring](../../admin/maintenance/alert-monitoring.md) | v | v | v | v | v |
| [RBAC](../../admin/authentication/authentication-overview.md#role-based-access-control-rbac-in-runai) | v | v | v | v | |
| Workload awareness | v | v | v | v | |
| [Workload submission](../../Researcher/workloads/managing-workloads.md) | v | v | v | v | |
| Workload actions (stop/run) | v | v | v | | |
| [Policies](../../platform-admin/workloads/policies/overview.md) | v | v | v | v | |
| [Scheduling rules](../../platform-admin/aiinitiatives/org/scheduling-rules.md) | v | v | v | | |

!!! Note
__Workload awareness__

Specific workload-aware visibility, so that different pods are identified and treated as a single workload (for example GPU utilization, workload view, dashboards).

## Workload scopes

Workloads must be created under a [project](../../platform-admin/aiinitiatives/org/projects.md). A project is the fundamental organization unit in the Run:ai account. To manage workloads, it’s required to first create a project or have one created by the administrator.

## Policies and rules

[Policies and rules](../../platform-admin/workloads/policies/overview.md) empower administrators to establish default values and implement restrictions on workloads allowing enhanced control, assuring compatibility with organizational policies, and optimizing resource usage and utilization.

## Workload statuses

The following table describes the different phases in a workload life cycle.

| Phase | Description | Entry condition | Exit condition |
| :---- | :---- | :---- | :---- |
| Creating | Workload setup is initiated in the Cluster. Resources and pods are now provisioning | A workload is submitted | A multi-pod group is created |
| Pending | Workload is queued and awaiting resource allocation. | A pod group exists | All pods are scheduled |
| Initializing | Workload is retrieving images, starting containers, and preparing pods | All pods are scheduled—handling of multi-pod groups TBD | All pods are initialized or a failure to initialize is detected |
| Running | Workload is currently in progress with all pods operational | All pods initialized (all containers in pods are ready) | workload completion or failure |
| Degraded | Pods may not align with specifications, network services might be incomplete, or persistent volumes may be detached. Check your logs for specific details. | Pending: All pods are running but with issues Running: All pods are running with no issues. | Running: All resources are OK Completed: Workload finished with fewer resources Failed: Workload failure or user-defined rules |
| Deleting | Workload and its associated resources are being decommissioned from the cluster | Deleting the workload. | Resources are fully deleted |
| Stopped | The workload is on hold and resources are intact but inactive | Stopping the workload without deleting resources | Transitioning back to the initializing phase or proceeding to deleting the workload |
| Failed | Image retrieval failed or containers experienced a crash. Check your logs for specific details. | An error occurs preventing the successful completion of the workload | Terminal State |
| Completed | Workload has successfully finished its execution | The workload has finished processing without errors | Terminal State |

2 changes: 1 addition & 1 deletion docs/admin/authentication/non-root-containers.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ then verify that you cannot run `su` to become root within the container.
### Setting a Cluster-Wide Default


The two flags are voluntary. They are not enforced by the system. It is however possible to enforce them using [Policies](../../platform-admin/workloads/policies/policies.md). Policies allow an Administrator to force compliance on both the User Interface and Command-line interface.
The two flags are voluntary. They are not enforced by the system. It is however possible to enforce them using [Policies](../../platform-admin/workloads/policies/overview.md). Policies allow an Administrator to force compliance on both the User Interface and Command-line interface.


## Passing user identity
Expand Down
2 changes: 1 addition & 1 deletion docs/developer/cluster-api/workload-overview-dev.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,4 +87,4 @@ Each workload type has a matching kind of workload policy. For example, an `Inte

A Policy of each type can be defined _per-project_. There is also a _global_ policy that applies to any project that does not have a per-project policy.

For further details on policies, see [Policies](../../platform-admin/workloads/policies/policies.md).
For further details on policies, see [Policies](../../platform-admin/workloads/policies/overview.md).
2 changes: 1 addition & 1 deletion docs/home/whats-new-2-13.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ The association between workspaces and node pools is done using *Compute resourc

**Policies**
<!-- RUN-10588/10590 Allow workload policy to prevent the use of a new pvc -->
* Improved policy support by adding `DEFAULTS` in the `items` section in the policy. The `DEFAULTS` section sets the default behavior for items declared in this section. For example, this can be use to limit the submission of workloads only to existing PVCs. For more information and an example, see Policies, [Complex values](../platform-admin/workloads/policies/policies.md#complex-values).
* Improved policy support by adding `DEFAULTS` in the `items` section in the policy. The `DEFAULTS` section sets the default behavior for items declared in this section. For example, this can be use to limit the submission of workloads only to existing PVCs. For more information and an example, see Policies, [Complex values](../platform-admin/workloads/policies/old-policies.md#complex-values).

<!-- RUN-8904/8960 - Cluster wide PVC in workspaces -->
* Added support for making a PVC data source available to all projects. In the *New data source* form, when creating a new PVC data source, select *All* from the *Project* pane.
Expand Down
4 changes: 2 additions & 2 deletions docs/home/whats-new-2-15.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,8 +78,8 @@ date: 2023-Dec-3
#### Policies

* <!-- RUN-12698/RUN-12699 -->During Workspaces and Training creation, assets that do not comply with policies cannot be selected. These assets are greyed out and have a button on the cards when the item does not comply with a configured policy. The button displays information about which policies are non-compliant.
* <!-- RUN-10622/RUN-10625 Policy blocks workloads that attempt to store data on the node-->Added configuration options to *Policies* in order to prevent the submission of workloads that use data sources of type `host path`. This prevents data from being stored on the node, so that data is not lost when a node is deleted. For configuration information, see [Prevent Data Storage on the Node](../platform-admin/workloads/policies/policies.md#prevent-data-storage-on-the-node).
* <!-- RUN-10575/RUN-10579 Add numeric rules in the policy to GPU memory, CPU memory & CPU -->Improved flexibility when creating policies which provide the ability to allocate a `min` and a `max` value for CPU and GPU memory. For configuration information, see [GPU and CPU memory limits](../platform-admin/workloads/policies/policies.md#gpu-and-cpu-memory-limits) in *Configuring policies*.
* <!-- RUN-10622/RUN-10625 Policy blocks workloads that attempt to store data on the node-->Added configuration options to *Policies* in order to prevent the submission of workloads that use data sources of type `host path`. This prevents data from being stored on the node, so that data is not lost when a node is deleted. For configuration information, see [Prevent Data Storage on the Node](../platform-admin/workloads/policies/old-policies.md#prevent-data-storage-on-the-node).
* <!-- RUN-10575/RUN-10579 Add numeric rules in the policy to GPU memory, CPU memory & CPU -->Improved flexibility when creating policies which provide the ability to allocate a `min` and a `max` value for CPU and GPU memory. For configuration information, see [GPU and CPU memory limits](../platform-admin/workloads/policies/old-policies.md#gpu-and-cpu-memory-limits) in *Configuring policies*.

#### Nodes and Node Pools

Expand Down
2 changes: 1 addition & 1 deletion docs/platform-admin/workloads/assets/data-volumes.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,6 @@ You can attach a data volume to a workload during submission in the same way oth

Researchers can list available data volumes within their permitted scopes for easy selection.

For more information on using a data volume when submitting a workload, see [Submitting Workloads](submitting-workloads.md).
For more information on using a data volume when submitting a workload, see [Submitting Workloads](../submitting-workloads.md).

You can also add a data volumes to your workload when submitting a workload via the API. For more information, see [Workloads](https://app.run.ai/api/docs#tag/Workloads).
2 changes: 1 addition & 1 deletion docs/platform-admin/workloads/policies/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ A workload policy is an end-to-end solution for AI managers and administrators t

Run:ai provides two policy technologies.

[**YAML-Based policies**](policies.md) are the older policies. These policies:
[**YAML-Based policies**](old-policies.md) are the older policies. These policies:

* Require access to Kubernetes to view or change.
* Contact Run:ai support to convert the old policies to the new V2 policies format.
Expand Down
4 changes: 2 additions & 2 deletions docs/platform-admin/workloads/submitting-workloads.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,9 +180,9 @@ To submit a workload using the UI:

## Workload Policies

As an administrator, you can set *Policies* on Workloads. Policies allow administrators to *impose restrictions* and set *default values* for Researcher Workloads. For more information see [Workload Policies](../workloads/policies/policies.md).
As an administrator, you can set *Policies* on Workloads. Policies allow administrators to *impose restrictions* and set *default values* for Researcher Workloads. For more information see [Workload Policies](../workloads/policies/overview.md).

## Worklaod Ownership Protection
## Workload Ownership Protection

Workload ownership protection in Run:ai ensures that only users who created a workload can delete or modify them. This feature is designed to safeguard important jobs and configurations from accidental or unauthorized modifications by users who did not originally create the workload.

Expand Down
2 changes: 1 addition & 1 deletion graveyard/whats-new-2-14.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ TODO Add RBAC old--new conversion table here. -->

### Policy improvements

* <!-- RUN-10575/RUN-10579 Add numeric rules in the policy to GPU memory, CPU memory & CPU -->Improved flexibility when creating policies which provides the ability to allocate a `min` and a `max` value for CPU and GPU memory. For configuration information, see [GPU and CPU memory limits](../platform-admin/workloads/policies.md#gpu-and-cpu-memory-limits) in *Configuring policies*.
* <!-- RUN-10575/RUN-10579 Add numeric rules in the policy to GPU memory, CPU memory & CPU -->Improved flexibility when creating policies which provides the ability to allocate a `min` and a `max` value for CPU and GPU memory. For configuration information, see [GPU and CPU memory limits](../platform-admin/workloads/old-policies.md#gpu-and-cpu-memory-limits) in *Configuring policies*.

### Resource costing

Expand Down
2 changes: 1 addition & 1 deletion graveyard/whats-new-2022.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
* __CPU and CPU memory quotas__ can now be configured for projects and departments. These are hard quotas which means that the total amount of the requested resource for all workloads associated with a project/department cannot exceed the set limit. To enable this feature please call Run:ai customer support.
* __Workloads__. We have revamped the way Run:ai submits Jobs. Run:ai now submits [Workloads](../platform-admin/workloads/submitting-workloads.md). The change includes:
* New [Cluster API](../developer/cluster-api/workload-overview-dev.md). The older [API](../developer/deprecated/researcher-rest-api/overview.md) has been deprecated and remains for backward compatibility. The API creates all the resources required for the run, including volumes, services, and the like. It also deletes all resources when the workload itself is deleted.
* Administrative templates have been replaced with [Policies](../platform-admin/workloads/policies.md). Policies apply across all ways to submit jobs: command-line, API, and user interface.
* Administrative templates have been replaced with [Policies](../platform-admin/workloads/policies/overview.md). Policies apply across all ways to submit jobs: command-line, API, and user interface.
* `runai delete` has been changed in favor of `runai delete job`
* Self-hosted installation: The default OpenShift installation is now set to work with a __configured__ Openshift IdP. See [creation of backend values](../admin/runai-setup/self-hosted/ocp/backend.md) for more information. In addition, the default for OpenShift is now HTTPS.
* To send logs to Run:ai customer support there is a utility to package all logs into one tar file. Version 2.5 brings a new method that __automatically sends all new logs to Run:ai support__ servers for a set amount of time. See [collecting logs](../index.md#collect-logs-to-send-to-support) for more information.
Expand Down
2 changes: 1 addition & 1 deletion graveyard/workload-overview-admin.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,4 +150,4 @@ To submit a workload using the UI:

## Workload Policies

As an administrator, you can set *Policies* on Workloads. Policies allow administrators to *impose restrictions* and set *default values* for Researcher Workloads. For more information see [Workload Policies](../workloads/policies/policies.md).
As an administrator, you can set *Policies* on Workloads. Policies allow administrators to *impose restrictions* and set *default values* for Researcher Workloads. For more information see [Workload Policies](../workloads/policies/old-policies.md).
3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -300,7 +300,7 @@ nav:
- 'Policies Examples' : 'platform-admin/workloads/policies/policy-examples.md'
- 'Policies Reference' : 'platform-admin/workloads/policies/policy-reference.md'
- 'Older Policies' :
- 'Policies V1' : 'platform-admin/workloads/policies/policies.md'
- 'Policies V1' : 'platform-admin/workloads/policies/old-policies.md'


- 'Best Practices' :
Expand Down Expand Up @@ -329,6 +329,7 @@ nav:
- 'Queue Fairness' : 'Researcher/Walkthroughs/walkthrough-queue-fairness.md'
- 'Workloads' :
- 'Managing Workloads' : 'Researcher/workloads/managing-workloads.md'
- 'Workload Support' : 'Researcher/workloads/workload-support.md'
- 'Workload Assets' :
- 'Overview' : 'Researcher/workloads/assets/overview.md'
- 'Environments' : 'Researcher/workloads/assets/environments.md'
Expand Down

0 comments on commit f0892ba

Please sign in to comment.