Skip to content

Commit

Permalink
Merge pull request #321 from run-ai/v2.12
Browse files Browse the repository at this point in the history
V2.12
  • Loading branch information
jasonnovichRunAI committed Jun 18, 2023
2 parents 9c35cfb + 3a46e57 commit 1dc25e5
Show file tree
Hide file tree
Showing 6 changed files with 68 additions and 212 deletions.
27 changes: 24 additions & 3 deletions docs/Researcher/cli-reference/runai-submit.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Submit a Run:ai Job for execution.

## Examples

All examples assume a Run:ai Project has been set using `runai config project <project-name>`.
All examples assume a Run:ai Project has been setup using `runai config project <project-name>`.

Start an interactive Job:

Expand Down Expand Up @@ -66,12 +66,25 @@ Submit a Job without a name (automatically generates a name)
runai submit -i gcr.io/run-ai-demo/quickstart -g 1
```

Submit a job using the system autogenerated name to an external URL:

```console
runai submit -i ubuntu --interactive --attach -g 1 service-type=external-url --port 3745 --custom-url=<destination_url>
```

Submit a job without a name to a system generated a URL :

```console
runai submit -i ubuntu --interactive --attach -g 1 service-type=external-url --port 3745
```

Submit a Job without a name with a pre-defined prefix and an incremental index suffix

```console
runai submit --job-name-prefix -i gcr.io/run-ai-demo/quickstart -g 1
```


## Options

### Job Type
Expand Down Expand Up @@ -357,8 +370,16 @@ runai submit --job-name-prefix -i gcr.io/run-ai-demo/quickstart -g 1
#### -s | --service-type `<string>`

> External access type to interactive jobs. Options are: portforward (deprecated), loadbalancer, nodeport, ingress.
> External access type to interactive jobs. Options are:
> * `portforward` (deprecated)
> * `loadbalancer`
> * `nodeport`
> * `external-url`
#### --custom-url `<string>`

> An optional argument that specifies a custom URL when using the `external URL` service type. If not provided, the system will generate a URL automatically.
### Access Control

#### --allow-privilege-escalation
Expand Down
2 changes: 2 additions & 0 deletions docs/admin/integration/deepspeed.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# DeepSpeed Integration with Run:ai

# Working with DeepSpeed on top of Run:ai

DeepSpeed is a deep learning optimization library for PyTorch designed to reduce computing power and memory use, and to train large distributed models with better parallelism on existing computer hardware. DeepSpeed is optimized for low latency, high throughput training. It also includes the Zero Redundancy Optimizer (ZeRO) for training models with 1 trillion or more parameters. Other features include mixed precision training, single-GPU, multi-GPU, multi-node training, and custom model parallelism.

This article describes how to run a distributed workload on Kubernetes using an MPIJob with
Expand Down
7 changes: 4 additions & 3 deletions docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,8 @@ Following are instructions on how to get the IP and set firewall settings.
If not already installed on your cluster, install the full `kube-prometheus-stack` through the [Prometheus community Operator](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack){target=_blank}.

!!! Note
If Prometheus has been installed on the cluster in the past, even if it was uninstalled (such as when upgrading from Run:ai 2.8 or lower), you will need to update Prometheus CRDs as described [here](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#upgrading-chart){target=_blank}. For more information on the Prometheus bug see [here](https://github.com/prometheus-community/helm-charts/issues/2753){target=_blank}.
* If Prometheus has been installed on the cluster in the past, even if it was uninstalled (such as when upgrading from Run:ai 2.8 or lower), you will need to update Prometheus CRDs as described [here](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#upgrading-chart){target=_blank}. For more information on the Prometheus bug see [here](https://github.com/prometheus-community/helm-charts/issues/2753){target=_blank}.
* If you are running Kubernetes 1.21, you must install a Prometheus stack version of 45.23.0 or lower. Use the `--version` flag below. Alternatively, use helm version 3.12 or later. For more information on the related Prometheus bug see [here](https://github.com/prometheus-community/helm-charts/issues/3436){target=_blank}

Then install the Prometheus stack by running:

Expand Down Expand Up @@ -333,8 +334,8 @@ However, for the URL to be accessible outside the cluster you must configure you

* (Production only) __Run:ai System Nodes__: To reduce downtime and save CPU cycles on expensive GPU Machines, we recommend that production deployments will contain __two or more__ worker machines, designated for Run:ai Software. The nodes do not have to be dedicated to Run:ai, but for Run:ai purposes we would need:

* 8 CPUs
* 16GB of RAM
* 4 CPUs
* 8GB of RAM
* 50GB of Disk space

* __Shared data volume:__ Run:ai uses Kubernetes to abstract away the machine on which a container is running:
Expand Down
2 changes: 1 addition & 1 deletion docs/developer/metrics/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,4 +128,4 @@ For additional information, see Kubernetes [kube-state-metrics](https://github.c

## Create custom dasbhoards

To create custom dashboards based on the above metrics, please contact Run:ai customer support.
To create custom dashboards based on the above metrics, please contact Run:ai customer support.
70 changes: 35 additions & 35 deletions docs/home/whats-new-2-10.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,21 +100,21 @@ Cluster wide PVC is now replicated to namespaces that do not have an existing PV

## Fixed issues

|Internal ID|Description|
|-----------|--------------|
| RUN-9196 | Fixed dashboard overview displaying `running_workloads:cpu_only` rule.|
| RUN-9256 | Now supports the global configuration of memory request of memory-sensitive pods in the cluster.|
| RUN-9219 | Fixed `runai describe` on pytorch outputs "Is Distributed Workload: false".|
| RUN-9221 | Fixed CLI `runai describe` job nil pointer exception.|
| RUN-9220 | Fixed PVC duplication errors so that it does not duplicate for namespaces with the same PVC name and bound PVCs.|
| RUN-9224 | Fixed Scheduler not reporting the correct event on EFA (status history).|
| RUN-9189 | Improved Scheduler performance to reclaim action slowness in really big clusters.|
| RUN-450 | Change "edit boxes" to labels. |
| RUN-9218 | Added support for `pod-running-timeout` when using `runai port-forward`.|
| RUN-9252 | Fixed `runai port-forward` to be consistent with `runai bash` (`--target` is now `--pod`).|
| RUN-9071 | Fixed registries api call crashing the ui when returning an error.|
| RUN-8794 | Newer dashboards are now deployed for tenants using grafanlabs.|
| RUN-9212 | Fixed filter jobs by type. As a workaround, you can also you can sort by type.|
| Internal ID | Description |
| ----------- | ---------------------------------------------------------------------------------------------------------------- |
| RUN-9196 | Fixed dashboard overview displaying `running_workloads:cpu_only` rule. |
| RUN-9256 | Now supports the global configuration of memory request of memory-sensitive pods in the cluster. |
| RUN-9219 | Fixed `runai describe` on pytorch outputs "Is Distributed Workload: false". |
| RUN-9221 | Fixed CLI `runai describe` job nil pointer exception. |
| RUN-9220 | Fixed PVC duplication errors so that it does not duplicate for namespaces with the same PVC name and bound PVCs. |
| RUN-9224 | Fixed Scheduler not reporting the correct event on EFA (status history). |
| RUN-9189 | Improved Scheduler performance to reclaim action slowness in really big clusters. |
| RUN-450 | Change "edit boxes" to labels. |
| RUN-9218 | Added support for `pod-running-timeout` when using `runai port-forward`. |
| RUN-9252 | Fixed `runai port-forward` to be consistent with `runai bash` (`--target` is now `--pod`). |
| RUN-9071 | Fixed registries api call crashing the ui when returning an error. |
| RUN-8794 | Newer dashboards are now deployed for tenants using grafanlabs. |
| RUN-9212 | Fixed filter jobs by type. As a workaround, you can also you can sort by type. |

---------------------
## Version 2.10.5
Expand Down Expand Up @@ -194,26 +194,26 @@ Added support Ephemeral PVC in CLI and in the job submission form. For more info

## Known issues

|Internal ID|Description|Workaround|
|-----------|--------------|--------------|
| RUN-8695 | SSO users that logged in via SAML can't login again after disabling and reenabling SSO. | |
| RUN-8680 | A user in an OCP group with roles that belong to that group should be able to submit a job from the UI. | |
| RUN-8601 | Warning when the CLI command `runai suspend` is used. | |
| RUN-8422 | Remove Knative unnecessary requests when inference is not enabled. | |
| RUN-7874 | A new job returns `malformed URL` when a project is not connected to a namespace. | |
| RUN-6301 | A job in the job list side panel shows both `pending` and `running` at the same time. | |
| Internal ID | Description | Workaround |
| ----------- | ------------------------------------------------------------------------------------------------------- | ---------- |
| RUN-8695 | SSO users that logged in via SAML can't login again after disabling and reenabling SSO. | |
| RUN-8680 | A user in an OCP group with roles that belong to that group should be able to submit a job from the UI. | |
| RUN-8601 | Warning when the CLI command `runai suspend` is used. | |
| RUN-8422 | Remove Knative unnecessary requests when inference is not enabled. | |
| RUN-7874 | A new job returns `malformed URL` when a project is not connected to a namespace. | |
| RUN-6301 | A job in the job list side panel shows both `pending` and `running` at the same time. | |

## Fixed issues

|Internal ID|Description|
|-----------|--------------|
| RUN-8223 | Missed foreign key to tenants table. |
| RUN-5187 | S3 can now be configured to work in airgapped environments. |
| RUN-8276 | 503 error when creating a workload (request timeout for validation webhook). |
| RUN-7266 | Allocation bug - a researcher asked for 2 GPU for Interactive Job and other jobs received the allocated GPU within the same node |
| RUN-8418 | different user when submitting via runai cli and vi ui submit form |
| RUN-6838 | When submitting a job with port out of range, the job is submitted successfully however the submission actually fails. |
| RUN-8196 | Nodepools aren't visible in 2.9 UI. |
| RUN-7435 | Run:ai CLI submit doesn't parse correctly environment variables that end with a '='. |
| RUN-8192 | The UI shows a deleted job in the Current Jobs tab. |
| RUN-7776 | User does not exist in the UI due to pagination limitation. |
| Internal ID | Description |
| ----------- | -------------------------------------------------------------------------------------------------------------------------------- |
| RUN-8223 | Missed foreign key to tenants table. |
| RUN-5187 | S3 can now be configured to work in airgapped environments. |
| RUN-8276 | 503 error when creating a workload (request timeout for validation webhook). |
| RUN-7266 | Allocation bug - a researcher asked for 2 GPU for Interactive Job and other jobs received the allocated GPU within the same node |
| RUN-8418 | different user when submitting via runai cli and vi ui submit form |
| RUN-6838 | When submitting a job with port out of range, the job is submitted successfully however the submission actually fails. |
| RUN-8196 | Nodepools aren't visible in 2.9 UI. |
| RUN-7435 | Run:ai CLI submit doesn't parse correctly environment variables that end with a '='. |
| RUN-8192 | The UI shows a deleted job in the Current Jobs tab. |
| RUN-7776 | User does not exist in the UI due to pagination limitation. |
172 changes: 2 additions & 170 deletions docs/home/whats-new-2-8.md
Original file line number Diff line number Diff line change
@@ -1,175 +1,7 @@
# Run:ai Version 2.8

## Version 2.8.21

### Release date

May 2022

#### Fixed issues

|Internal ID|Description|
|-----------|--------------|
| RUN-9832 | Fixed an issue where the node pool controller removes unschedulable jobs from nodes. |
| RUN-10047| Fixed an issue where the node pool column is missing in the control-plane. |

## Version 2.8.20

### Release date

May 2022

#### Fixed issues

|Internal ID|Description|
|-----------|--------------|
| RUN-9485 | Fixed an issue where jobs that have failed appear as if they are running. |

## Version 2.8.19

### Release date

May 2022

#### Fixed issues

|Internal ID|Description|
|-----------|--------------|
| RUN-9354 | Fixed an issue where the `RUNAI_GPU_MEMORY_LIMIT` environment variable is set and not applied. |

## Version 2.8.18

### Release date

May 2022

#### Fixed issues

|Internal ID|Description|
|-----------|--------------|
| RUN-9113 | Fixed an issue in the Scheduler where pods were scheduled to nodes without enough CPU resources. |

## Version 2.8.17

### Release date

May 2022

<!-- RUN-6345 -->
Added the `Node Pool` column to the `Jobs`, `Inference`, and `Workspaces` tables in the *UI*. This feature is only available when using Control Plane 2.9 or later.

#### Fixed issues

|Internal ID|Description|
|-----------|--------------|
| RUN-8709 | Fixed an issue to make S3 storage work in airgapped environments. |
| RUN-8276 | Fixed an issue when creating a workload leads to a 503 error by increasing the timeout. |

## Version 2.8.16

### Release date


#### Fixed issues

|Internal ID|Description|
|-----------|--------------|
| RUN-8246 | Fixed an issue with large files uploading via Jupyter notebooks. |
| RUN-8366 | Fixed an issue where the scheduler is slow when many podgroups are configured. |

## Version 2.8.15

### Release date

#### Fixed issues

|Internal ID|Description|
|-----------|--------------|
| RUN-7686 | Fixed issues with syncing node pools from the cluster in OpenShift environments. |

## Version 2.8.14

### Release date

#### Fixed issues

|Internal ID|Description|
|-----------|--------------|
| RUN-7776 | Fixed a *UI* issue not displaying more than 100 users. |
| RUN-7726 | Increased the number of allowed API requests to the API server from the researcher service to prevent performance throttling. |
| RUN-7106 | Fixed the *UI* not showing workloads in the cluster when its stopped due to marking the `podgroup` as `not in cluster`. |
| RUN-6995 | Fixed an issue where Group Mapping from an SSO Group to the Researcher Manager Role was not working. |

## Version 2.8.13

### Release date

<!-- RUN-6732 -->
Added support for the scheduling of Kubeflow PyTorch jobs.

#### Fixed issues

|Internal ID|Description|
|-----------|--------------|
| RUN-7240 | Fixed inability to submit fractional jobs on non-default node pools. |
| RUN-7205 | Fixed an issue where `configmaps` aren't deleted for deployments even after the relevant pods are removed. |
| RUN-6832 | Fixed prometheus deployment not discovering the `servicemonitors` within projects. |
| RUN-6800 | Fixed incorrect Prometheus permissions for querying job metrics. |
| RUN-6766 | Fixed an issue mounting s3 file systems. |
| RUN-6538 | Fixed an issue in the Scheduler where the pod was restarted due to an `out of memory` error. |
| RUN-6109 | Fixed an issue in the *UI* that prevents the quick creation of sequential jobs. |
| RUN-5527 | Fixed an issue where idle allocated GPU metrics are not displayed for MIG workloads in OpenShift. |
| RUN-5489 | Fixed an issue when installing Run:ai cluster components that require root access. |

## Version 2.8.12

### Release date

#### Fixed issues

|Internal ID|Description|
|-----------|--------------|
| RUN-6216 | Fixed an issue with the multi cluster overview dashboard so that the allocated GPU in the table of each cluster is correct. |

## Version 2.8.11

### Release date

<!--RUN-6392 -->
Changed the option to generate Jupyter arguments from using `startNotebook` to any command.

#### Fixed issues

|Internal ID|Description|
|-----------|--------------|
| RUN-6718 | Fixed an issue where some graphs were showing the wrong date. |
| RUN-6667 | Fixed an issue where the Run:ai scheduler was crashing in a reclaim action. |
| RUN-6604 | Fixed an issue where a new MIG request is issued without the device size. |
| RUN-6536 | Fixed a crash in the *cli* related to the policy for `allow-privilege-escalation`. |
| RUN-6460 | Fixed an issue using a Jupyter notebook to mount an S3 bucket and not permitting read/write access.|
| RUN-6400 | Fixed an issue on EKS (Amazon Kubernetes Server), where every *CLI* command response starts with an error. |
| RUN-6399 | Fixed an issue where `requestedGPU` is always 0 for MPI jobs displayed in the distributed workloads Job list.|
| RUN-6359 | Fixed an issue with `UnexpectedAdmissionError` on a job using a fractional GPU. |
| RUN-6309 | Fixed an issue where the dynamic MIG Manager didn't connect to a cluster role in OpenShift environments. |
| RUN-5492 | Fixed an issue where `runai-container-toolkit` doesn't need root permissions. |
| RUN-5444 | Fixed an issue where the Dynamic MIG feature was not working with A-100 and 80GB of memory. |
| RUN-5226 | Fixed an issue where there is more than 1 NVIDIA MIG workload, the `nvidia-smi` command sent to one of the workloads will result with no devices.|

## Version 2.8.9

### Release date

#### Fixed issues

|Internal ID|Description|
|-----------|--------------|
|RUN-6519 | Fixed an issue with the scheduler where it was not able to detect PV and PVCs. |

## Version 2.8.0

### Release date

November 2022
## Release Date
November 2022

## Release Content
<!--
Expand Down

0 comments on commit 1dc25e5

Please sign in to comment.