diff --git a/v2.13/404.html b/v2.13/404.html index 62a5d77b07..38f30b39ec 100644 --- a/v2.13/404.html +++ b/v2.13/404.html @@ -1 +1 @@ - Run:ai Documentation Library

Document Not Found

The link you have used does not point to an existing document. Please search for the content on the top right, use the navigation bar to find what you are looking for or submit a ticket here.

\ No newline at end of file + Run:ai Documentation Library

Document Not Found

The link you have used does not point to an existing document. Please search for the content on the top right, use the navigation bar to find what you are looking for or submit a ticket here.

\ No newline at end of file diff --git a/v2.13/Researcher/Walkthroughs/quickstart-inference/index.html b/v2.13/Researcher/Walkthroughs/quickstart-inference/index.html index c089282d51..21dc439d1c 100644 --- a/v2.13/Researcher/Walkthroughs/quickstart-inference/index.html +++ b/v2.13/Researcher/Walkthroughs/quickstart-inference/index.html @@ -1,4 +1,4 @@ - Inference - Run:ai Documentation Library
Skip to content

runai submit-dist tf

Description

Version 2.10 and later.

Submit a distributed TensorFlow training run:ai job to run.

Note

To use distributed training you need to have installed the < insert TensorFlow operator here > as specified < insert pre-requisites link here >.

Syntax notes:

  • Options with a value type of stringArray mean that you can add multiple values. You can either separate values with a comma or add the flag twice.

Examples

runai submit-dist tf --name distributed-job --workers=2 -g 1 \
+                   

runai submit-dist tf

Description

Version 2.10 and later.

Submit a distributed TensorFlow training run:ai job to run.

Note

To use distributed training you need to have installed the < insert TensorFlow operator here > as specified < insert pre-requisites link here >.

Syntax notes:

  • Options with a value type of stringArray mean that you can add multiple values. You can either separate values with a comma or add the flag twice.

Examples

runai submit-dist tf --name distributed-job --workers=2 -g 1 \
     -i <image_name
 >
-

Options

Distributed

--clean-pod-policy < string >

The CleanPodPolicy controls deletion of pods when a job terminates. The policy can be one of the following values:

  • Running—only pods still running when a job completes (for example, parameter servers) will be deleted immediately. Completed pods will not be deleted so that the logs will be preserved. (Default)
  • All—all (including completed) pods will be deleted immediately when the job finishes.
  • None—no pods will be deleted when the job completes.

--workers < int>

Number of replicas for Inference jobs

Naming and Shortcuts

--job-name-prefix <string>

The prefix to use to automatically generate a Job name with an incremental index. When a Job name is omitted Run:ai will generate a Job name. The optional --job-name-prefix flag creates Job names with the provided prefix.

--name <string>

The name of the Job.

--template <string>

Load default values from a workload.

Container Definition

--add-capability <stringArray>

Add linux capabilities to the container.

-a | --annotation <stringArray>

Set annotations variables in the container.

--attach

Default is false. If set to true, wait for the Pod to start running. When the pod starts running, attach to the Pod. The flag is equivalent to the command runai attach.

The --attach flag also sets --tty and --stdin to true.

--command

Overrides the image's entry point with the command supplied after '--'. When not using the --command flag, the entry point will not be overrided and the string after -- will be appended as arguments to the entry point command.

Example:

--command -- run.sh 1 54 will start the docker and run run.sh 1 54

-- script.py 10000 will augment script.py 10000 to the entry point command (e.g. python)

--create-home-dir

Create a temporary home directory for the user in the container. Data saved in this directory will not be saved when the container exits. For more information see non root containers.

-e <stringArray> | --environment`

Define environment variables to be set in the container. To set multiple values add the flag multiple times (-e BATCH_SIZE=50 -e LEARNING_RATE=0.2).

--image <string> | -i <string>

Image to use when creating the container for this Job

--image-pull-policy <string>

Pulling policy of the image when starting a container. Options are:

  • Always (default): force image pulling to check whether local image already exists. If the image already exists locally and has the same digest, then the image will not be downloaded.
  • IfNotPresent: the image is pulled only if it is not already present locally.
  • Never: the image is assumed to exist locally. No attempt is made to pull the image.

For more information see Kubernetes documentation.

-l | --label <stringArray>

Set labels variables in the container.

--preferred-pod-topology-key <string>

If possible, all pods of this job will be scheduled onto nodes that have a label with this key and identical values.

--required-pod-topology-key <string>

Enforce scheduling pods of this job onto nodes that have a label with this key and identical values.

--stdin

Keep stdin open for the container(s) in the pod, even if nothing is attached.is attached.

-t | --tty

Allocate a pseudo-TTY.

--working-dir <string>

Starts the container with the specified directory as the current directory.

Resource Allocation

--cpu <double>

CPU units to allocate for the Job (0.5, 1, .etc). The Job will receive at least this amount of CPU. Note that the Job will not be scheduled unless the system can guarantee this amount of CPUs to the Job.

--cpu-limit <double>

Limitations on the number of CPUs consumed by the Job (for example 0.5, 1). The system guarantees that this Job will not be able to consume more than this amount of CPUs.

--extended-resource `

Request access to extended resource, syntax <resource-name> = < resource_quantity >

-g | --gpu <float>

GPU units to allocate for the Job (0.5, 1).

--gpu-memory

GPU memory to allocate for this Job (1G, 20M, .etc). The Job will receive this amount of memory. Note that the Job will not be scheduled unless the system can guarantee this amount of GPU memory to the Job.

--memory <string>

CPU memory to allocate for this Job (1G, 20M, .etc). The Job will receive at least this amount of memory. Note that the Job will not be scheduled unless the system can guarantee this amount of memory to the Job.

--memory-limit `

CPU memory to allocate for this Job (1G, 20M, .etc). The system guarantees that this Job will not be able to consume more than this amount of memory. The Job will receive an error when trying to allocate more memory than this limit.

--mig-profile <string>

MIG profile to allocate for the job (1g.5gb, 2g.10gb, 3g.20gb, 4g.20gb, 7g.40gb)

Job Lifecycle

--backoff-limit <int>

The number of times the Job will be retried before failing. The default is 6. This flag will only work with training workloads (when the --interactive flag is not specified).

Storage

--git-sync <stringArray>

Clone a git repository into the container running the Job. The parameter should follow the syntax: source=REPOSITORY,branch=BRANCH_NAME,rev=REVISION,username=USERNAME,password=PASSWORD,target=TARGET_DIRECTORY_TO_CLONE.

--large-shm

Mount a large /dev/shm device.

--mount-propagation

Enable HostToContainer mount propagation for all container volumes

--nfs-server <string>

Use this flag to specify a default NFS host for --volume flag. Alternatively, you can specify NFS host for each volume individually (see --volume for details).

--pvc [Storage_Class_Name]:Size:Container_Mount_Path:[ro]

--pvc Pvc_Name:Container_Mount_Path:[ro]

Mount a persistent volume claim into a container.

Note

This option is being deprecated from version 2.10 and above. To mount existing or newly created Persistent Volume Claim (PVC), use the parameters --pvc-exists and --pvc-new.

The 2 syntax types of this command are mutually exclusive. You can either use the first or second form, but not a mixture of both.

Storage_Class_Name is a storage class name that can be obtained by running kubectl get storageclasses.storage.k8s.io. This parameter may be omitted if there is a single storage class in the system, or you are using the default storage class.

Size is the volume size you want to allocate. See Kubernetes documentation for how to specify volume sizes

Container_Mount_Path. A path internal to the container where the storage will be mounted

Pvc_Name. The name of a pre-existing Persistent Volume Claim to mount into the container

Examples:

--pvc :3Gi:/tmp/john:ro - Allocate 3GB from the default Storage class. Mount it to /tmp/john as read-only

--pvc my-storage:3Gi:/tmp/john:ro - Allocate 3GB from the my-storage storage class. Mount it to /tmp/john as read-only

--pvc :3Gi:/tmp/john - Allocate 3GB from the default storage class. Mount it to /tmp/john as read-write

--pvc my-pvc:/tmp/john - Use a Persistent Volume Claim named my-pvc. Mount it to /tmp/john as read-write

--pvc my-pvc-2:/tmp/john:ro - Use a Persistent Volume Claim named my-pvc-2. Mount it to /tmp/john as read-only

--pvc-exists <string>

Mount a persistent volume. You must include a claimname and path.

  • claim name—The name of the persistent colume claim. Can be obtained by running

kubectl get storageclasses.storage.k8s.io

  • path—the path internal to the container where the storage will be mounted

Use the format:

claimname=<CLAIM_NAME>,path=<PATH>

--pvc-new <string>

Mount a persistent volume claim (PVC). If the PVC does not exist, it will be created based on the parameters entered. If a PVC exists, it will be used with its defined attributes and the parameters in the command will be ignored.

  • claim name—The name of the persistent colume claim.
  • storage class—A storage class name that can be obtained by running

kubectl get storageclasses.storage.k8s.io.

storageclass may be omitted if there is a single storage class in the system, or you are using the default storage class.

  • size—The volume size you want to allocate for the PVC when creating it. See Kubernetes documentation to specify volume sizes.
  • accessmode—The description of thr desired volume capabilities for the PVC.
  • ro—Mount the PVC with read-only access.
  • ephemeral—The PVC will be created as volatile temporary storage which is only present during the running lifetime of the job.

Use the format:

storageclass= <storageclass>,size= <size>, path= <path>, ro, accessmode-rwm

--s3 <string>

Mount an S3 compatible storage into the container running the job. The parameter should follow the syntax:

bucket=BUCKET,key=KEY,secret=SECRET,url=URL,target=TARGET_PATH

All the fields, except url=URL, are mandatory. Default for url is

url=https://s3.amazon.com

-v | --volume 'Source:Container_Mount_Path:[ro]:[nfs-host]'

Volumes to mount into the container.

Examples:

-v /raid/public/john/data:/root/data:ro

Mount /root/data to local path /raid/public/john/data for read-only access.

-v /public/data:/root/data::nfs.example.com

Mount /root/data to NFS path /public/data on NFS server nfs.example.com for read-write access.

Network

--address <string>

Comma separated list of IP addresses to listen to when running with --service-type portforward (default: localhost)

--host-ipc

Use the host's ipc namespace. Controls whether the pod containers can share the host IPC namespace. IPC (POSIX/SysV IPC) namespace provides separation of named shared memory segments, semaphores, and message queues. Shared memory segments are used to accelerate inter-process communication at memory speed, rather than through pipes or the network stack.

For further information see docker run reference documentation.

--host-network

Use the host's network stack inside the container. For further information see docker run referencedocumentation.

--port <stringArray>

Expose ports from the Job container.

-s | --service-type <string>

External access type to interactive jobs. Options are: portforward, loadbalancer, nodeport, ingress.

Access Control

--allow-privilege-escalation

Allow the job to gain additional privileges after start.

--run-as-user

Run in the context of the current user running the Run:ai command rather than the root user. While the default container user is root (same as in Docker), this command allows you to submit a Job running under your Linux user. This would manifest itself in access to operating system resources, in the owner of new folders created under shared directories, etc. Alternatively, if your cluster is connected to Run:ai via SAML, you can map the container to use the Linux UID/GID which is stored in the organization's directory. For more information see non root containers.

Scheduling

--node-pools <string>

Instructs the scheduler to run this workload using specific set of nodes which are part of a Node Pool. You can specify one or more node pools to form a prioritized list of node pools that the scheduler will use to find one node pool that can provide the workload's specification. To use this feature your Administrator will need to label nodes as explained here: Limit a Workload to a Specific Node Group or use existing node labels, then create a node-pool and assign the label to the node-pool. This flag can be used in conjunction with node-type and Project-based affinity. In this case, the flag is used to refine the list of allowable node groups set from a node-pool. For more information see: Working with Projects.

--node-type <string>

Allows defining specific Nodes (machines) or a group of Nodes on which the workload will run. To use this feature your Administrator will need to label nodes as explained here: Limit a Workload to a Specific Node Group.

--toleration <string>

Specify one or more toleration criteria, to ensure that the workload is not scheduled onto an inappropriate node. This is done by matching the workload tolerations to the taints defined for each node. For further details see Kubernetes Taints and Tolerations Guide.

The format of the string:

operator=Equal|Exists,key=KEY,[value=VALUE],[effect=NoSchedule|NoExecute|PreferNoSchedule],[seconds=SECONDS]
-

Global Flags

--loglevel (string)

Set the logging level. One of: debug | info | warn | error (default "info")

--project | -p (string)

Specify the Project to which the command applies. Run:ai Projects are used by the scheduler to calculate resource eligibility. By default, commands apply to the default Project. To change the default Project use runai config project <project-name>.

--help | -h

Show help text.

Output

The command will attempt to submit an mpi Job. You can follow up on the Job by running runai list jobs or runai describe job <job-name>.

See Also


Last update: 2023-07-16
Created: 2023-03-07

runai submit-dist mpi

Description

Submit a Distributed Training (MPI) Run:ai Job to run.

Note

To use distributed training you need to have installed the Kubeflow MPI Operator as specified here

Syntax notes:

  • Options with a value type of stringArray mean that you can add multiple values. You can either separate values with a comma or add the flag twice.

Examples

You can start an unattended mpi training Job of name dist1, based on Project team-a using a quickstart-distributed image:

runai submit-dist mpi --name dist1 --workers=2 -g 1 \
+                   

runai submit-dist mpi

Description

Submit a Distributed Training (MPI) Run:ai Job to run.

Note

To use distributed training you need to have installed the Kubeflow MPI Operator as specified here

Syntax notes:

  • Options with a value type of stringArray mean that you can add multiple values. You can either separate values with a comma or add the flag twice.

Examples

You can start an unattended mpi training Job of name dist1, based on Project team-a using a quickstart-distributed image:

runai submit-dist mpi --name dist1 --workers=2 -g 1 \
     -i gcr.io/run-ai-demo/quickstart-distributed:v0.3.0 -e RUNAI_SLEEP_SECS=60
-

(see: distributed training Quickstart).

Options

Distributed

--clean-pod-policy < string >

The CleanPodPolicy controls deletion of pods when a job terminates. The policy can be one of the following values:

  • Running—only pods still running when a job completes (for example, parameter servers) will be deleted immediately. Completed pods will not be deleted so that the logs will be preserved. (Default)
  • All—all (including completed) pods will be deleted immediately when the job finishes.
  • None—no pods will be deleted when the job completes.

--workers < int >

Number of replicas for Inference jobs.

--slots-per-worker < int >

Number of slots to allocate for each worker.

Naming and Shortcuts

--job-name-prefix <string>

The prefix to use to automatically generate a Job name with an incremental index. When a Job name is omitted Run:ai will generate a Job name. The optional --job-name-prefix flag creates Job names with the provided prefix.

--name <string>

The name of the Job.

--template <string>

Load default values from a workload.

Container Definition

--add-capability <stringArray>

Add linux capabilities to the container.

-a | --annotation <stringArray>

Set annotations variables in the container.

--attach

Default is false. If set to true, wait for the Pod to start running. When the pod starts running, attach to the Pod. The flag is equivalent to the command runai attach.

The --attach flag also sets --tty and --stdin to true.

--command

Overrides the image's entry point with the command supplied after '--'. When not using the --command flag, the entry point will not be overrided and the string after -- will be appended as arguments to the entry point command.

Example:

--command -- run.sh 1 54 will start the docker and run run.sh 1 54

-- script.py 10000 will augment script.py 10000 to the entry point command (e.g. python)

--create-home-dir

Create a temporary home directory for the user in the container. Data saved in this directory will not be saved when the container exits. For more information see non root containers.

-e <stringArray> | --environment`

Define environment variables to be set in the container. To set multiple values add the flag multiple times (-e BATCH_SIZE=50 -e LEARNING_RATE=0.2).

--image <string> | -i <string>

Image to use when creating the container for this Job

--image-pull-policy <string>

Pulling policy of the image when starting a container. Options are:

  • Always (default): force image pulling to check whether local image already exists. If the image already exists locally and has the same digest, then the image will not be downloaded.
  • IfNotPresent: the image is pulled only if it is not already present locally.
  • Never: the image is assumed to exist locally. No attempt is made to pull the image.

For more information see Kubernetes documentation.

-l | --label <stringArray>

Set labels variables in the container.

--preferred-pod-topology-key <string>

If possible, all pods of this job will be scheduled onto nodes that have a label with this key and identical values.

--required-pod-topology-key <string>

Enforce scheduling pods of this job onto nodes that have a label with this key and identical values.

--stdin

Keep stdin open for the container(s) in the pod, even if nothing is attached.is attached.

-t | --tty

Allocate a pseudo-TTY.

--working-dir <string>

Starts the container with the specified directory as the current directory.

Resource Allocation

--cpu <double>

CPU units to allocate for the Job (0.5, 1, .etc). The Job will receive at least this amount of CPU. Note that the Job will not be scheduled unless the system can guarantee this amount of CPUs to the Job.

--cpu-limit <double>

Limitations on the number of CPUs consumed by the Job (for example 0.5, 1). The system guarantees that this Job will not be able to consume more than this amount of CPUs.

--extended-resource `

Request access to extended resource, syntax <resource-name> = < resource_quantity >

-g | --gpu <float>

GPU units to allocate for the Job (0.5, 1).

--gpu-memory

GPU memory to allocate for this Job (1G, 20M, .etc). The Job will receive this amount of memory. Note that the Job will not be scheduled unless the system can guarantee this amount of GPU memory to the Job.

--memory <string>

CPU memory to allocate for this Job (1G, 20M, .etc). The Job will receive at least this amount of memory. Note that the Job will not be scheduled unless the system can guarantee this amount of memory to the Job.

--memory-limit `

CPU memory to allocate for this Job (1G, 20M, .etc). The system guarantees that this Job will not be able to consume more than this amount of memory. The Job will receive an error when trying to allocate more memory than this limit.

--mig-profile <string>

MIG profile to allocate for the job (1g.5gb, 2g.10gb, 3g.20gb, 4g.20gb, 7g.40gb)

Job Lifecycle

--backoff-limit <int>

The number of times the Job will be retried before failing. The default is 6. This flag will only work with training workloads (when the --interactive flag is not specified).

Storage

--git-sync <stringArray>

Clone a git repository into the container running the Job. The parameter should follow the syntax: source=REPOSITORY,branch=BRANCH_NAME,rev=REVISION,username=USERNAME,password=PASSWORD,target=TARGET_DIRECTORY_TO_CLONE.

--large-shm

Mount a large /dev/shm device.

--mount-propagation

Enable HostToContainer mount propagation for all container volumes

--nfs-server <string>

Use this flag to specify a default NFS host for --volume flag. Alternatively, you can specify NFS host for each volume individually (see --volume for details).

--pvc [Storage_Class_Name]:Size:Container_Mount_Path:[ro]

--pvc Pvc_Name:Container_Mount_Path:[ro]

Mount a persistent volume claim into a container.

Note

This option is being deprecated from version 2.10 and above. To mount existing or newly created Persistent Volume Claim (PVC), use the parameters --pvc-exists and --pvc-new.

The 2 syntax types of this command are mutually exclusive. You can either use the first or second form, but not a mixture of both.

Storage_Class_Name is a storage class name that can be obtained by running kubectl get storageclasses.storage.k8s.io. This parameter may be omitted if there is a single storage class in the system, or you are using the default storage class.

Size is the volume size you want to allocate. See Kubernetes documentation for how to specify volume sizes

Container_Mount_Path. A path internal to the container where the storage will be mounted

Pvc_Name. The name of a pre-existing Persistent Volume Claim to mount into the container

Examples:

--pvc :3Gi:/tmp/john:ro - Allocate 3GB from the default Storage class. Mount it to /tmp/john as read-only

--pvc my-storage:3Gi:/tmp/john:ro - Allocate 3GB from the my-storage storage class. Mount it to /tmp/john as read-only

--pvc :3Gi:/tmp/john - Allocate 3GB from the default storage class. Mount it to /tmp/john as read-write

--pvc my-pvc:/tmp/john - Use a Persistent Volume Claim named my-pvc. Mount it to /tmp/john as read-write

--pvc my-pvc-2:/tmp/john:ro - Use a Persistent Volume Claim named my-pvc-2. Mount it to /tmp/john as read-only

--pvc-exists <string>

Mount a persistent volume. You must include a claimname and path.

  • claim name—The name of the persistent colume claim. Can be obtained by running

kubectl get storageclasses.storage.k8s.io

  • path—the path internal to the container where the storage will be mounted

Use the format:

claimname=<CLAIM_NAME>,path=<PATH>

--pvc-new <string>

Mount a persistent volume claim (PVC). If the PVC does not exist, it will be created based on the parameters entered. If a PVC exists, it will be used with its defined attributes and the parameters in the command will be ignored.

  • claim name—The name of the persistent colume claim.
  • storage class—A storage class name that can be obtained by running

kubectl get storageclasses.storage.k8s.io.

storageclass may be omitted if there is a single storage class in the system, or you are using the default storage class.

  • size—The volume size you want to allocate for the PVC when creating it. See Kubernetes documentation to specify volume sizes.
  • accessmode—The description of thr desired volume capabilities for the PVC.
  • ro—Mount the PVC with read-only access.
  • ephemeral—The PVC will be created as volatile temporary storage which is only present during the running lifetime of the job.

Use the format:

storageclass= <storageclass>,size= <size>, path= <path>, ro, accessmode-rwm

--s3 <string>

Mount an S3 compatible storage into the container running the job. The parameter should follow the syntax:

bucket=BUCKET,key=KEY,secret=SECRET,url=URL,target=TARGET_PATH

All the fields, except url=URL, are mandatory. Default for url is

url=https://s3.amazon.com

-v | --volume 'Source:Container_Mount_Path:[ro]:[nfs-host]'

Volumes to mount into the container.

Examples:

-v /raid/public/john/data:/root/data:ro

Mount /root/data to local path /raid/public/john/data for read-only access.

-v /public/data:/root/data::nfs.example.com

Mount /root/data to NFS path /public/data on NFS server nfs.example.com for read-write access.

Network

--address <string>

Comma separated list of IP addresses to listen to when running with --service-type portforward (default: localhost)

--host-ipc

Use the host's ipc namespace. Controls whether the pod containers can share the host IPC namespace. IPC (POSIX/SysV IPC) namespace provides separation of named shared memory segments, semaphores, and message queues. Shared memory segments are used to accelerate inter-process communication at memory speed, rather than through pipes or the network stack.

For further information see docker run reference documentation.

--host-network

Use the host's network stack inside the container. For further information see docker run referencedocumentation.

--port <stringArray>

Expose ports from the Job container.

-s | --service-type <string>

External access type to interactive jobs. Options are: portforward, loadbalancer, nodeport, ingress.

Access Control

--allow-privilege-escalation

Allow the job to gain additional privileges after start.

--run-as-user

Run in the context of the current user running the Run:ai command rather than the root user. While the default container user is root (same as in Docker), this command allows you to submit a Job running under your Linux user. This would manifest itself in access to operating system resources, in the owner of new folders created under shared directories, etc. Alternatively, if your cluster is connected to Run:ai via SAML, you can map the container to use the Linux UID/GID which is stored in the organization's directory. For more information see non root containers.

Scheduling

--node-pools <string>

Instructs the scheduler to run this workload using specific set of nodes which are part of a Node Pool. You can specify one or more node pools to form a prioritized list of node pools that the scheduler will use to find one node pool that can provide the workload's specification. To use this feature your Administrator will need to label nodes as explained here: Limit a Workload to a Specific Node Group or use existing node labels, then create a node-pool and assign the label to the node-pool. This flag can be used in conjunction with node-type and Project-based affinity. In this case, the flag is used to refine the list of allowable node groups set from a node-pool. For more information see: Working with Projects.

--node-type <string>

Allows defining specific Nodes (machines) or a group of Nodes on which the workload will run. To use this feature your Administrator will need to label nodes as explained here: Limit a Workload to a Specific Node Group.

--toleration <string>

Specify one or more toleration criteria, to ensure that the workload is not scheduled onto an inappropriate node. This is done by matching the workload tolerations to the taints defined for each node. For further details see Kubernetes Taints and Tolerations Guide.

The format of the string:

operator=Equal|Exists,key=KEY,[value=VALUE],[effect=NoSchedule|NoExecute|PreferNoSchedule],[seconds=SECONDS]
-

Global Flags

--loglevel (string)

Set the logging level. One of: debug | info | warn | error (default "info")

--project | -p (string)

Specify the Project to which the command applies. Run:ai Projects are used by the scheduler to calculate resource eligibility. By default, commands apply to the default Project. To change the default Project use runai config project <project-name>.

--help | -h

Show help text.

Output

The command will attempt to submit an mpi Job. You can follow up on the Job by running runai list jobs or runai describe job <job-name>.

See Also


Last update: 2023-07-16
Created: 2020-07-19

runai submit-dist pytorch

Description

Version 2.10 and later.

Submit a distributed PyTorch training run:ai job to run.

Note

To use distributed training you need to have installed the < insert pytorch operator here > as specified < insert pre-requisites link here >.

Syntax notes:

  • Options with a value type of stringArray mean that you can add multiple values. You can either separate values with a comma or add the flag twice.

Examples

runai submit-dist pytorch --name distributed-job --workers=2 -g 1 \
+                   

runai submit-dist pytorch

Description

Version 2.10 and later.

Submit a distributed PyTorch training run:ai job to run.

Note

To use distributed training you need to have installed the < insert pytorch operator here > as specified < insert pre-requisites link here >.

Syntax notes:

  • Options with a value type of stringArray mean that you can add multiple values. You can either separate values with a comma or add the flag twice.

Examples

runai submit-dist pytorch --name distributed-job --workers=2 -g 1 \
     -i <image_name>
-

Options

Distributed

--clean-pod-policy < string >

The CleanPodPolicy controls deletion of pods when a job terminates. The policy can be one of the following values:

  • Running—only pods still running when a job completes (for example, parameter servers) will be deleted immediately. Completed pods will not be deleted so that the logs will be preserved. (Default)
  • All—all (including completed) pods will be deleted immediately when the job finishes.
  • None—no pods will be deleted when the job completes.

--max-replicas < int >

Maximum number of replicas for elastic PyTorch job.

--min-replicas < int >

Minimum number of replicas for elastic PyTorch job.

--workers < int>

Number of replicas for Inference jobs

Naming and Shortcuts

--job-name-prefix <string>

The prefix to use to automatically generate a Job name with an incremental index. When a Job name is omitted Run:ai will generate a Job name. The optional --job-name-prefix flag creates Job names with the provided prefix.

--name <string>

The name of the Job.

--template <string>

Load default values from a workload.

Container Definition

--add-capability <stringArray>

Add linux capabilities to the container.

-a | --annotation <stringArray>

Set annotations variables in the container.

--attach

Default is false. If set to true, wait for the Pod to start running. When the pod starts running, attach to the Pod. The flag is equivalent to the command runai attach.

The --attach flag also sets --tty and --stdin to true.

--command

Overrides the image's entry point with the command supplied after '--'. When not using the --command flag, the entry point will not be overrided and the string after -- will be appended as arguments to the entry point command.

Example:

--command -- run.sh 1 54 will start the docker and run run.sh 1 54

-- script.py 10000 will augment script.py 10000 to the entry point command (e.g. python)

--create-home-dir

Create a temporary home directory for the user in the container. Data saved in this directory will not be saved when the container exits. For more information see non root containers.

-e <stringArray> | --environment`

Define environment variables to be set in the container. To set multiple values add the flag multiple times (-e BATCH_SIZE=50 -e LEARNING_RATE=0.2).

--image <string> | -i <string>

Image to use when creating the container for this Job

--image-pull-policy <string>

Pulling policy of the image when starting a container. Options are:

  • Always (default): force image pulling to check whether local image already exists. If the image already exists locally and has the same digest, then the image will not be downloaded.
  • IfNotPresent: the image is pulled only if it is not already present locally.
  • Never: the image is assumed to exist locally. No attempt is made to pull the image.

For more information see Kubernetes documentation.

-l | --label <stringArray>

Set labels variables in the container.

--preferred-pod-topology-key <string>

If possible, all pods of this job will be scheduled onto nodes that have a label with this key and identical values.

--required-pod-topology-key <string>

Enforce scheduling pods of this job onto nodes that have a label with this key and identical values.

--stdin

Keep stdin open for the container(s) in the pod, even if nothing is attached.is attached.

-t | --tty

Allocate a pseudo-TTY.

--working-dir <string>

Starts the container with the specified directory as the current directory.

Resource Allocation

--cpu <double>

CPU units to allocate for the Job (0.5, 1, .etc). The Job will receive at least this amount of CPU. Note that the Job will not be scheduled unless the system can guarantee this amount of CPUs to the Job.

--cpu-limit <double>

Limitations on the number of CPUs consumed by the Job (for example 0.5, 1). The system guarantees that this Job will not be able to consume more than this amount of CPUs.

--extended-resource `

Request access to extended resource, syntax <resource-name> = < resource_quantity >

-g | --gpu <float>

GPU units to allocate for the Job (0.5, 1).

--gpu-memory

GPU memory to allocate for this Job (1G, 20M, .etc). The Job will receive this amount of memory. Note that the Job will not be scheduled unless the system can guarantee this amount of GPU memory to the Job.

--memory <string>

CPU memory to allocate for this Job (1G, 20M, .etc). The Job will receive at least this amount of memory. Note that the Job will not be scheduled unless the system can guarantee this amount of memory to the Job.

--memory-limit `

CPU memory to allocate for this Job (1G, 20M, .etc). The system guarantees that this Job will not be able to consume more than this amount of memory. The Job will receive an error when trying to allocate more memory than this limit.

--mig-profile <string>

MIG profile to allocate for the job (1g.5gb, 2g.10gb, 3g.20gb, 4g.20gb, 7g.40gb)

Job Lifecycle

--backoff-limit <int>

The number of times the Job will be retried before failing. The default is 6. This flag will only work with training workloads (when the --interactive flag is not specified).

Storage

--git-sync <stringArray>

Clone a git repository into the container running the Job. The parameter should follow the syntax: source=REPOSITORY,branch=BRANCH_NAME,rev=REVISION,username=USERNAME,password=PASSWORD,target=TARGET_DIRECTORY_TO_CLONE.

--large-shm

Mount a large /dev/shm device.

--mount-propagation

Enable HostToContainer mount propagation for all container volumes

--nfs-server <string>

Use this flag to specify a default NFS host for --volume flag. Alternatively, you can specify NFS host for each volume individually (see --volume for details).

--pvc [Storage_Class_Name]:Size:Container_Mount_Path:[ro]

--pvc Pvc_Name:Container_Mount_Path:[ro]

Mount a persistent volume claim into a container.

Note

This option is being deprecated from version 2.10 and above. To mount existing or newly created Persistent Volume Claim (PVC), use the parameters --pvc-exists and --pvc-new.

The 2 syntax types of this command are mutually exclusive. You can either use the first or second form, but not a mixture of both.

Storage_Class_Name is a storage class name that can be obtained by running kubectl get storageclasses.storage.k8s.io. This parameter may be omitted if there is a single storage class in the system, or you are using the default storage class.

Size is the volume size you want to allocate. See Kubernetes documentation for how to specify volume sizes

Container_Mount_Path. A path internal to the container where the storage will be mounted

Pvc_Name. The name of a pre-existing Persistent Volume Claim to mount into the container

Examples:

--pvc :3Gi:/tmp/john:ro - Allocate 3GB from the default Storage class. Mount it to /tmp/john as read-only

--pvc my-storage:3Gi:/tmp/john:ro - Allocate 3GB from the my-storage storage class. Mount it to /tmp/john as read-only

--pvc :3Gi:/tmp/john - Allocate 3GB from the default storage class. Mount it to /tmp/john as read-write

--pvc my-pvc:/tmp/john - Use a Persistent Volume Claim named my-pvc. Mount it to /tmp/john as read-write

--pvc my-pvc-2:/tmp/john:ro - Use a Persistent Volume Claim named my-pvc-2. Mount it to /tmp/john as read-only

--pvc-exists <string>

Mount a persistent volume. You must include a claimname and path.

  • claim name—The name of the persistent colume claim. Can be obtained by running

kubectl get storageclasses.storage.k8s.io

  • path—the path internal to the container where the storage will be mounted

Use the format:

claimname=<CLAIM_NAME>,path=<PATH>

--pvc-new <string>

Mount a persistent volume claim (PVC). If the PVC does not exist, it will be created based on the parameters entered. If a PVC exists, it will be used with its defined attributes and the parameters in the command will be ignored.

  • claim name—The name of the persistent colume claim.
  • storage class—A storage class name that can be obtained by running

kubectl get storageclasses.storage.k8s.io.

storageclass may be omitted if there is a single storage class in the system, or you are using the default storage class.

  • size—The volume size you want to allocate for the PVC when creating it. See Kubernetes documentation to specify volume sizes.
  • accessmode—The description of thr desired volume capabilities for the PVC.
  • ro—Mount the PVC with read-only access.
  • ephemeral—The PVC will be created as volatile temporary storage which is only present during the running lifetime of the job.

Use the format:

storageclass= <storageclass>,size= <size>, path= <path>, ro, accessmode-rwm

--s3 <string>

Mount an S3 compatible storage into the container running the job. The parameter should follow the syntax:

bucket=BUCKET,key=KEY,secret=SECRET,url=URL,target=TARGET_PATH

All the fields, except url=URL, are mandatory. Default for url is

url=https://s3.amazon.com

-v | --volume 'Source:Container_Mount_Path:[ro]:[nfs-host]'

Volumes to mount into the container.

Examples:

-v /raid/public/john/data:/root/data:ro

Mount /root/data to local path /raid/public/john/data for read-only access.

-v /public/data:/root/data::nfs.example.com

Mount /root/data to NFS path /public/data on NFS server nfs.example.com for read-write access.

Network

--address <string>

Comma separated list of IP addresses to listen to when running with --service-type portforward (default: localhost)

--host-ipc

Use the host's ipc namespace. Controls whether the pod containers can share the host IPC namespace. IPC (POSIX/SysV IPC) namespace provides separation of named shared memory segments, semaphores, and message queues. Shared memory segments are used to accelerate inter-process communication at memory speed, rather than through pipes or the network stack.

For further information see docker run reference documentation.

--host-network

Use the host's network stack inside the container. For further information see docker run referencedocumentation.

--port <stringArray>

Expose ports from the Job container.

-s | --service-type <string>

External access type to interactive jobs. Options are: portforward, loadbalancer, nodeport, ingress.

Access Control

--allow-privilege-escalation

Allow the job to gain additional privileges after start.

--run-as-user

Run in the context of the current user running the Run:ai command rather than the root user. While the default container user is root (same as in Docker), this command allows you to submit a Job running under your Linux user. This would manifest itself in access to operating system resources, in the owner of new folders created under shared directories, etc. Alternatively, if your cluster is connected to Run:ai via SAML, you can map the container to use the Linux UID/GID which is stored in the organization's directory. For more information see non root containers.

Scheduling

--node-pools <string>

Instructs the scheduler to run this workload using specific set of nodes which are part of a Node Pool. You can specify one or more node pools to form a prioritized list of node pools that the scheduler will use to find one node pool that can provide the workload's specification. To use this feature your Administrator will need to label nodes as explained here: Limit a Workload to a Specific Node Group or use existing node labels, then create a node-pool and assign the label to the node-pool. This flag can be used in conjunction with node-type and Project-based affinity. In this case, the flag is used to refine the list of allowable node groups set from a node-pool. For more information see: Working with Projects.

--node-type <string>

Allows defining specific Nodes (machines) or a group of Nodes on which the workload will run. To use this feature your Administrator will need to label nodes as explained here: Limit a Workload to a Specific Node Group.

--toleration <string>

Specify one or more toleration criteria, to ensure that the workload is not scheduled onto an inappropriate node. This is done by matching the workload tolerations to the taints defined for each node. For further details see Kubernetes Taints and Tolerations Guide.

The format of the string:

operator=Equal|Exists,key=KEY,[value=VALUE],[effect=NoSchedule|NoExecute|PreferNoSchedule],[seconds=SECONDS]
-

Global Flags

--loglevel (string)

Set the logging level. One of: debug | info | warn | error (default "info")

--project | -p (string)

Specify the Project to which the command applies. Run:ai Projects are used by the scheduler to calculate resource eligibility. By default, commands apply to the default Project. To change the default Project use runai config project <project-name>.

--help | -h

Show help text.

Output

The command will attempt to submit an mpi Job. You can follow up on the Job by running runai list jobs or runai describe job <job-name>.

See Also

< please let me know if this is needed, or if additional documentation is needed in the link > * See Quickstart document Running Distributed Training.


Last update: 2023-07-16
Created: 2023-03-07

runai submit-dist xgboost

Description

Submit a distributed XGBoost training run:ai job to run.

Syntax notes:

  • Options with a value type of stringArray mean that you can add multiple values. You can either separate values with a comma or add the flag twice.

Examples

runai submit-dist xgboost --name distributed-job --workers=2 -g 1 \
+                   

runai submit-dist xgboost

Description

Submit a distributed XGBoost training run:ai job to run.

Syntax notes:

  • Options with a value type of stringArray mean that you can add multiple values. You can either separate values with a comma or add the flag twice.

Examples

runai submit-dist xgboost --name distributed-job --workers=2 -g 1 \
     -i <image_name
 >
-

Options

Distributed

--clean-pod-policy < string >

The CleanPodPolicy controls deletion of pods when a job terminates. The policy can be one of the following values:

  • Running—only pods still running when a job completes (for example, parameter servers) will be deleted immediately. Completed pods will not be deleted so that the logs will be preserved. (Default)
  • All—all (including completed) pods will be deleted immediately when the job finishes.
  • None—no pods will be deleted when the job completes.

--workers < int>

Number of replicas for Inference jobs

Naming and Shortcuts

--job-name-prefix <string>

The prefix to use to automatically generate a Job name with an incremental index. When a Job name is omitted Run:ai will generate a Job name. The optional --job-name-prefix flag creates Job names with the provided prefix.

--name <string>

The name of the Job.

--template <string>

Load default values from a workload.

Container Definition

--add-capability <stringArray>

Add linux capabilities to the container.

-a | --annotation <stringArray>

Set annotations variables in the container.

--attach

Default is false. If set to true, wait for the Pod to start running. When the pod starts running, attach to the Pod. The flag is equivalent to the command runai attach.

The --attach flag also sets --tty and --stdin to true.

--command

Overrides the image's entry point with the command supplied after '--'. When not using the --command flag, the entry point will not be overrided and the string after -- will be appended as arguments to the entry point command.

Example:

--command -- run.sh 1 54 will start the docker and run run.sh 1 54

-- script.py 10000 will augment script.py 10000 to the entry point command (e.g. python)

--create-home-dir

Create a temporary home directory for the user in the container. Data saved in this directory will not be saved when the container exits. For more information see non root containers.

-e <stringArray> | --environment`

Define environment variables to be set in the container. To set multiple values add the flag multiple times (-e BATCH_SIZE=50 -e LEARNING_RATE=0.2).

--image <string> | -i <string>

Image to use when creating the container for this Job

--image-pull-policy <string>

Pulling policy of the image when starting a container. Options are:

  • Always (default): force image pulling to check whether local image already exists. If the image already exists locally and has the same digest, then the image will not be downloaded.
  • IfNotPresent: the image is pulled only if it is not already present locally.
  • Never: the image is assumed to exist locally. No attempt is made to pull the image.

For more information see Kubernetes documentation.

-l | --label <stringArray>

Set labels variables in the container.

--preferred-pod-topology-key <string>

If possible, all pods of this job will be scheduled onto nodes that have a label with this key and identical values.

--required-pod-topology-key <string>

Enforce scheduling pods of this job onto nodes that have a label with this key and identical values.

--stdin

Keep stdin open for the container(s) in the pod, even if nothing is attached.is attached.

-t | --tty

Allocate a pseudo-TTY.

--working-dir <string>

Starts the container with the specified directory as the current directory.

Resource Allocation

--cpu <double>

CPU units to allocate for the Job (0.5, 1, .etc). The Job will receive at least this amount of CPU. Note that the Job will not be scheduled unless the system can guarantee this amount of CPUs to the Job.

--cpu-limit <double>

Limitations on the number of CPUs consumed by the Job (for example 0.5, 1). The system guarantees that this Job will not be able to consume more than this amount of CPUs.

--extended-resource `

Request access to extended resource, syntax <resource-name> = < resource_quantity >

-g | --gpu <float>

GPU units to allocate for the Job (0.5, 1).

--gpu-memory

GPU memory to allocate for this Job (1G, 20M, .etc). The Job will receive this amount of memory. Note that the Job will not be scheduled unless the system can guarantee this amount of GPU memory to the Job.

--memory <string>

CPU memory to allocate for this Job (1G, 20M, .etc). The Job will receive at least this amount of memory. Note that the Job will not be scheduled unless the system can guarantee this amount of memory to the Job.

--memory-limit `

CPU memory to allocate for this Job (1G, 20M, .etc). The system guarantees that this Job will not be able to consume more than this amount of memory. The Job will receive an error when trying to allocate more memory than this limit.

--mig-profile <string>

MIG profile to allocate for the job (1g.5gb, 2g.10gb, 3g.20gb, 4g.20gb, 7g.40gb)

Job Lifecycle

--backoff-limit <int>

The number of times the Job will be retried before failing. The default is 6. This flag will only work with training workloads (when the --interactive flag is not specified).

Storage

--git-sync <stringArray>

Clone a git repository into the container running the Job. The parameter should follow the syntax: source=REPOSITORY,branch=BRANCH_NAME,rev=REVISION,username=USERNAME,password=PASSWORD,target=TARGET_DIRECTORY_TO_CLONE.

--large-shm

Mount a large /dev/shm device.

--mount-propagation

Enable HostToContainer mount propagation for all container volumes

--nfs-server <string>

Use this flag to specify a default NFS host for --volume flag. Alternatively, you can specify NFS host for each volume individually (see --volume for details).

--pvc [Storage_Class_Name]:Size:Container_Mount_Path:[ro]

--pvc Pvc_Name:Container_Mount_Path:[ro]

Mount a persistent volume claim into a container.

Note

This option is being deprecated from version 2.10 and above. To mount existing or newly created Persistent Volume Claim (PVC), use the parameters --pvc-exists and --pvc-new.

The 2 syntax types of this command are mutually exclusive. You can either use the first or second form, but not a mixture of both.

Storage_Class_Name is a storage class name that can be obtained by running kubectl get storageclasses.storage.k8s.io. This parameter may be omitted if there is a single storage class in the system, or you are using the default storage class.

Size is the volume size you want to allocate. See Kubernetes documentation for how to specify volume sizes

Container_Mount_Path. A path internal to the container where the storage will be mounted

Pvc_Name. The name of a pre-existing Persistent Volume Claim to mount into the container

Examples:

--pvc :3Gi:/tmp/john:ro - Allocate 3GB from the default Storage class. Mount it to /tmp/john as read-only

--pvc my-storage:3Gi:/tmp/john:ro - Allocate 3GB from the my-storage storage class. Mount it to /tmp/john as read-only

--pvc :3Gi:/tmp/john - Allocate 3GB from the default storage class. Mount it to /tmp/john as read-write

--pvc my-pvc:/tmp/john - Use a Persistent Volume Claim named my-pvc. Mount it to /tmp/john as read-write

--pvc my-pvc-2:/tmp/john:ro - Use a Persistent Volume Claim named my-pvc-2. Mount it to /tmp/john as read-only

--pvc-exists <string>

Mount a persistent volume. You must include a claimname and path.

  • claim name—The name of the persistent colume claim. Can be obtained by running

kubectl get storageclasses.storage.k8s.io

  • path—the path internal to the container where the storage will be mounted

Use the format:

claimname=<CLAIM_NAME>,path=<PATH>

--pvc-new <string>

Mount a persistent volume claim (PVC). If the PVC does not exist, it will be created based on the parameters entered. If a PVC exists, it will be used with its defined attributes and the parameters in the command will be ignored.

  • claim name—The name of the persistent colume claim.
  • storage class—A storage class name that can be obtained by running

kubectl get storageclasses.storage.k8s.io.

storageclass may be omitted if there is a single storage class in the system, or you are using the default storage class.

  • size—The volume size you want to allocate for the PVC when creating it. See Kubernetes documentation to specify volume sizes.
  • accessmode—The description of thr desired volume capabilities for the PVC.
  • ro—Mount the PVC with read-only access.
  • ephemeral—The PVC will be created as volatile temporary storage which is only present during the running lifetime of the job.

Use the format:

storageclass= <storageclass>,size= <size>, path= <path>, ro, accessmode-rwm

--s3 <string>

Mount an S3 compatible storage into the container running the job. The parameter should follow the syntax:

bucket=BUCKET,key=KEY,secret=SECRET,url=URL,target=TARGET_PATH

All the fields, except url=URL, are mandatory. Default for url is

url=https://s3.amazon.com

-v | --volume 'Source:Container_Mount_Path:[ro]:[nfs-host]'

Volumes to mount into the container.

Examples:

-v /raid/public/john/data:/root/data:ro

Mount /root/data to local path /raid/public/john/data for read-only access.

-v /public/data:/root/data::nfs.example.com

Mount /root/data to NFS path /public/data on NFS server nfs.example.com for read-write access.

Network

--address <string>

Comma separated list of IP addresses to listen to when running with --service-type portforward (default: localhost)

--host-ipc

Use the host's ipc namespace. Controls whether the pod containers can share the host IPC namespace. IPC (POSIX/SysV IPC) namespace provides separation of named shared memory segments, semaphores, and message queues. Shared memory segments are used to accelerate inter-process communication at memory speed, rather than through pipes or the network stack.

For further information see docker run reference documentation.

--host-network

Use the host's network stack inside the container. For further information see docker run referencedocumentation.

--port <stringArray>

Expose ports from the Job container.

-s | --service-type <string>

External access type to interactive jobs. Options are: portforward, loadbalancer, nodeport, ingress.

Access Control

--allow-privilege-escalation

Allow the job to gain additional privileges after start.

--run-as-user

Run in the context of the current user running the Run:ai command rather than the root user. While the default container user is root (same as in Docker), this command allows you to submit a Job running under your Linux user. This would manifest itself in access to operating system resources, in the owner of new folders created under shared directories, etc. Alternatively, if your cluster is connected to Run:ai via SAML, you can map the container to use the Linux UID/GID which is stored in the organization's directory. For more information see non root containers.

Scheduling

--node-pools <string>

Instructs the scheduler to run this workload using specific set of nodes which are part of a Node Pool. You can specify one or more node pools to form a prioritized list of node pools that the scheduler will use to find one node pool that can provide the workload's specification. To use this feature your Administrator will need to label nodes as explained here: Limit a Workload to a Specific Node Group or use existing node labels, then create a node-pool and assign the label to the node-pool. This flag can be used in conjunction with node-type and Project-based affinity. In this case, the flag is used to refine the list of allowable node groups set from a node-pool. For more information see: Working with Projects.

--node-type <string>

Allows defining specific Nodes (machines) or a group of Nodes on which the workload will run. To use this feature your Administrator will need to label nodes as explained here: Limit a Workload to a Specific Node Group.

--toleration <string>

Specify one or more toleration criteria, to ensure that the workload is not scheduled onto an inappropriate node. This is done by matching the workload tolerations to the taints defined for each node. For further details see Kubernetes Taints and Tolerations Guide.

The format of the string:

operator=Equal|Exists,key=KEY,[value=VALUE],[effect=NoSchedule|NoExecute|PreferNoSchedule],[seconds=SECONDS]
-

Global Flags

--loglevel (string)

Set the logging level. One of: debug | info | warn | error (default "info")

--project | -p (string)

Specify the Project to which the command applies. Run:ai Projects are used by the scheduler to calculate resource eligibility. By default, commands apply to the default Project. To change the default Project use runai config project <project-name>.

--help | -h

Show help text.

Output

The command will attempt to submit an mpi Job. You can follow up on the Job by running runai list jobs or runai describe job <job-name>.

See Also


Last update: 2023-07-16
Created: 2023-03-07

runai submit

Description

Submit a Run:ai Job for execution.

Syntax notes:

  • Flags of type stringArray mean that you can add multiple values. You can either separate values with a comma or add the flag twice.

Examples

All examples assume a Run:ai Project has been setup using runai config project <project-name>.

Start an interactive Job:

runai submit -i ubuntu --interactive --attach -g 1
+                   

runai submit

Description

Submit a Run:ai Job for execution.

Syntax notes:

  • Flags of type stringArray mean that you can add multiple values. You can either separate values with a comma or add the flag twice.

Examples

All examples assume a Run:ai Project has been setup using runai config project <project-name>.

Start an interactive Job:

runai submit -i ubuntu --interactive --attach -g 1
 

Or

runai submit --name build1 -i ubuntu -g 1 --interactive -- sleep infinity 
 

(see: build Quickstart).

Externalize ports:

runai submit --name build-remote -i rastasheep/ubuntu-sshd:14.04 --interactive \
    --service-type=nodeport --port 30022:22
@@ -26,8 +26,8 @@
 

Submit a job using the system autogenerated name to an external URL:

runai submit -i ubuntu --interactive --attach -g 1 service-type=external-url --port 3745 --custom-url=<destination_url>
 

Submit a job without a name to a system generated a URL :

runai submit -i ubuntu --interactive --attach -g 1 service-type=external-url --port 3745
 

Submit a Job without a name with a pre-defined prefix and an incremental index suffix

runai submit --job-name-prefix -i gcr.io/run-ai-demo/quickstart -g 1 
-

Options

Job Type

--interactive

Mark this Job as interactive.

--jupyter

Run a Jupyter notebook using a default image and notebook configuration.

Job Lifecycle

--completions < int >

Number of successful pods required for this job to be completed. Used with HPO.

--parallelism < int >

Number of pods to run in parallel at any given time. Used with HPO.

--preemptible

Interactive preemptible jobs can be scheduled above guaranteed quota but may be reclaimed at any time.

--ttl-after-finish < duration >

The duration, after which a finished job is automatically deleted (e.g. 5s, 2m, 3h).

Naming and Shortcuts

--job-name-prefix <string>

The prefix to use to automatically generate a Job name with an incremental index. When a Job name is omitted Run:ai will generate a Job name. The optional --job-name-prefix flag creates Job names with the provided prefix.

--name <string>

The name of the Job.

--template <string>

Load default values from a workload.

Container Definition

--add-capability <stringArray>

Add linux capabilities to the container.

-a | --annotation <stringArray>

Set annotations variables in the container.

--attach

Default is false. If set to true, wait for the Pod to start running. When the pod starts running, attach to the Pod. The flag is equivalent to the command runai attach.

The --attach flag also sets --tty and --stdin to true.

--command

Overrides the image's entry point with the command supplied after '--'. When not using the --command flag, the entry point will not be overrided and the string after -- will be appended as arguments to the entry point command.

Example:

--command -- run.sh 1 54 will start the docker and run run.sh 1 54

-- script.py 10000 will augment script.py 10000 to the entry point command (e.g. python)

--create-home-dir

Create a temporary home directory for the user in the container. Data saved in this directory will not be saved when the container exits. For more information see non root containers.

-e <stringArray> | --environment`

Define environment variables to be set in the container. To set multiple values add the flag multiple times (-e BATCH_SIZE=50 -e LEARNING_RATE=0.2).

--image <string> | -i <string>

Image to use when creating the container for this Job

--image-pull-policy <string>

Pulling policy of the image when starting a container. Options are:

  • Always (default): force image pulling to check whether local image already exists. If the image already exists locally and has the same digest, then the image will not be downloaded.
  • IfNotPresent: the image is pulled only if it is not already present locally.
  • Never: the image is assumed to exist locally. No attempt is made to pull the image.

For more information see Kubernetes documentation.

-l | --label <stringArray>

Set labels variables in the container.

--preferred-pod-topology-key <string>

If possible, all pods of this job will be scheduled onto nodes that have a label with this key and identical values.

--required-pod-topology-key <string>

Enforce scheduling pods of this job onto nodes that have a label with this key and identical values.

--stdin

Keep stdin open for the container(s) in the pod, even if nothing is attached.is attached.

-t | --tty

Allocate a pseudo-TTY.

--working-dir <string>

Starts the container with the specified directory as the current directory.

Resource Allocation

--cpu <double>

CPU units to allocate for the Job (0.5, 1, .etc). The Job will receive at least this amount of CPU. Note that the Job will not be scheduled unless the system can guarantee this amount of CPUs to the Job.

--cpu-limit <double>

Limitations on the number of CPUs consumed by the Job (for example 0.5, 1). The system guarantees that this Job will not be able to consume more than this amount of CPUs.

--extended-resource `

Request access to extended resource, syntax <resource-name> = < resource_quantity >

-g | --gpu <float>

GPU units to allocate for the Job (0.5, 1).

--gpu-memory

GPU memory to allocate for this Job (1G, 20M, .etc). The Job will receive this amount of memory. Note that the Job will not be scheduled unless the system can guarantee this amount of GPU memory to the Job.

--memory <string>

CPU memory to allocate for this Job (1G, 20M, .etc). The Job will receive at least this amount of memory. Note that the Job will not be scheduled unless the system can guarantee this amount of memory to the Job.

--memory-limit `

CPU memory to allocate for this Job (1G, 20M, .etc). The system guarantees that this Job will not be able to consume more than this amount of memory. The Job will receive an error when trying to allocate more memory than this limit.

--mig-profile <string>

MIG profile to allocate for the job (1g.5gb, 2g.10gb, 3g.20gb, 4g.20gb, 7g.40gb)

Job Lifecycle

--backoff-limit <int>

The number of times the Job will be retried before failing. The default is 6. This flag will only work with training workloads (when the --interactive flag is not specified).

Storage

--git-sync <stringArray>

Clone a git repository into the container running the Job. The parameter should follow the syntax: source=REPOSITORY,branch=BRANCH_NAME,rev=REVISION,username=USERNAME,password=PASSWORD,target=TARGET_DIRECTORY_TO_CLONE.

--large-shm

Mount a large /dev/shm device.

--mount-propagation

Enable HostToContainer mount propagation for all container volumes

--nfs-server <string>

Use this flag to specify a default NFS host for --volume flag. Alternatively, you can specify NFS host for each volume individually (see --volume for details).

--pvc [Storage_Class_Name]:Size:Container_Mount_Path:[ro]

--pvc Pvc_Name:Container_Mount_Path:[ro]

Mount a persistent volume claim into a container.

Note

This option is being deprecated from version 2.10 and above. To mount existing or newly created Persistent Volume Claim (PVC), use the parameters --pvc-exists and --pvc-new.

The 2 syntax types of this command are mutually exclusive. You can either use the first or second form, but not a mixture of both.

Storage_Class_Name is a storage class name that can be obtained by running kubectl get storageclasses.storage.k8s.io. This parameter may be omitted if there is a single storage class in the system, or you are using the default storage class.

Size is the volume size you want to allocate. See Kubernetes documentation for how to specify volume sizes

Container_Mount_Path. A path internal to the container where the storage will be mounted

Pvc_Name. The name of a pre-existing Persistent Volume Claim to mount into the container

Examples:

--pvc :3Gi:/tmp/john:ro - Allocate 3GB from the default Storage class. Mount it to /tmp/john as read-only

--pvc my-storage:3Gi:/tmp/john:ro - Allocate 3GB from the my-storage storage class. Mount it to /tmp/john as read-only

--pvc :3Gi:/tmp/john - Allocate 3GB from the default storage class. Mount it to /tmp/john as read-write

--pvc my-pvc:/tmp/john - Use a Persistent Volume Claim named my-pvc. Mount it to /tmp/john as read-write

--pvc my-pvc-2:/tmp/john:ro - Use a Persistent Volume Claim named my-pvc-2. Mount it to /tmp/john as read-only

--pvc-exists <string>

Mount a persistent volume. You must include a claimname and path.

  • claim name—The name of the persistent colume claim. Can be obtained by running

kubectl get storageclasses.storage.k8s.io

  • path—the path internal to the container where the storage will be mounted

Use the format:

claimname=<CLAIM_NAME>,path=<PATH>

--pvc-new <string>

Mount a persistent volume claim (PVC). If the PVC does not exist, it will be created based on the parameters entered. If a PVC exists, it will be used with its defined attributes and the parameters in the command will be ignored.

  • claim name—The name of the persistent colume claim.
  • storage class—A storage class name that can be obtained by running

kubectl get storageclasses.storage.k8s.io.

storageclass may be omitted if there is a single storage class in the system, or you are using the default storage class.

  • size—The volume size you want to allocate for the PVC when creating it. See Kubernetes documentation to specify volume sizes.
  • accessmode—The description of thr desired volume capabilities for the PVC.
  • ro—Mount the PVC with read-only access.
  • ephemeral—The PVC will be created as volatile temporary storage which is only present during the running lifetime of the job.

Use the format:

storageclass= <storageclass>,size= <size>, path= <path>, ro, accessmode-rwm

--s3 <string>

Mount an S3 compatible storage into the container running the job. The parameter should follow the syntax:

bucket=BUCKET,key=KEY,secret=SECRET,url=URL,target=TARGET_PATH

All the fields, except url=URL, are mandatory. Default for url is

url=https://s3.amazon.com

-v | --volume 'Source:Container_Mount_Path:[ro]:[nfs-host]'

Volumes to mount into the container.

Examples:

-v /raid/public/john/data:/root/data:ro

Mount /root/data to local path /raid/public/john/data for read-only access.

-v /public/data:/root/data::nfs.example.com

Mount /root/data to NFS path /public/data on NFS server nfs.example.com for read-write access.

Network

--address <string>

Comma separated list of IP addresses to listen to when running with --service-type portforward (default: localhost)

--host-ipc

Use the host's ipc namespace. Controls whether the pod containers can share the host IPC namespace. IPC (POSIX/SysV IPC) namespace provides separation of named shared memory segments, semaphores, and message queues. Shared memory segments are used to accelerate inter-process communication at memory speed, rather than through pipes or the network stack.

For further information see docker run reference documentation.

--host-network

Use the host's network stack inside the container. For further information see docker run referencedocumentation.

--port <stringArray>

Expose ports from the Job container.

-s | --service-type <string>

External access type to interactive jobs. Options are:

  • portforward (deprecated)
  • loadbalancer
  • nodeport
  • external-url

--custom-url <string>

An optional argument that specifies a custom URL when using the external URL service type. If not provided, the system will generate a URL automatically.

Access Control

--allow-privilege-escalation

Allow the job to gain additional privileges after start.

--run-as-user

Run in the context of the current user running the Run:ai command rather than the root user. While the default container user is root (same as in Docker), this command allows you to submit a Job running under your Linux user. This would manifest itself in access to operating system resources, in the owner of new folders created under shared directories, etc. Alternatively, if your cluster is connected to Run:ai via SAML, you can map the container to use the Linux UID/GID which is stored in the organization's directory. For more information see non root containers.

Scheduling

--node-pools <string>

Instructs the scheduler to run this workload using specific set of nodes which are part of a Node Pool. You can specify one or more node pools to form a prioritized list of node pools that the scheduler will use to find one node pool that can provide the workload's specification. To use this feature your Administrator will need to label nodes as explained here: Limit a Workload to a Specific Node Group or use existing node labels, then create a node-pool and assign the label to the node-pool. This flag can be used in conjunction with node-type and Project-based affinity. In this case, the flag is used to refine the list of allowable node groups set from a node-pool. For more information see: Working with Projects.

--node-type <string>

Allows defining specific Nodes (machines) or a group of Nodes on which the workload will run. To use this feature your Administrator will need to label nodes as explained here: Limit a Workload to a Specific Node Group.

--toleration <string>

Specify one or more toleration criteria, to ensure that the workload is not scheduled onto an inappropriate node. This is done by matching the workload tolerations to the taints defined for each node. For further details see Kubernetes Taints and Tolerations Guide.

The format of the string:

operator=Equal|Exists,key=KEY,[value=VALUE],[effect=NoSchedule|NoExecute|PreferNoSchedule],[seconds=SECONDS]
-

Global Flags

--loglevel (string)

Set the logging level. One of: debug | info | warn | error (default "info")

--project | -p (string)

Specify the Project to which the command applies. Run:ai Projects are used by the scheduler to calculate resource eligibility. By default, commands apply to the default Project. To change the default Project use runai config project <project-name>.

--help | -h

Show help text.

Output

The command will attempt to submit a Job. You can follow up on the Job by running runai list jobs or runai describe job <job-name>.

Note that the submit call may use a policy to provide defaults to any of the above flags.

See Also


Last update: 2023-06-19
Created: 2020-07-19

Adding, Updating and Deleting Users

Introduction

The Run:ai User Interface allows the creation of Run:ai Users. Run:ai Users can receive varying levels of access to the Administration UI and submit Jobs on the Cluster.

Tip

It is possible to connect the Run:ai user interface to the organization's directory and use single sign-on. This allows you to set Run:ai roles for users and groups from the organizational directory. For further information see single sign-on configuration.

Working with Users

You can create users, as well as update and delete users.

Create a User

Note

To be able to review, add, update and delete users, you must have an Administrator access. If you do not have such access, please contact an Administrator.

Department Admin is available in version 2.10 and later.

  1. Login to the Users area of the Run:ai User interface at company-name.run.ai.
  2. Select the Users tab for local users, or the SSO Users tab for SSO users.
  3. On the top right, select "NEW USER".
  4. Enter the user's email.
  5. Select Roles. More than one role can be selected. Available roles are:

    • Administrator—Can manage Users and install Clusters.
    • Editor—Can manage Projects and Departments.
    • Viewer—View-only access to the Run:ai User Interface.
    • Researcher—Can submit ML workloads. Setting a user as a Researcher also requires assigning the user to projects.
    • Research Manager—Can act as Researcher in all projects, including new ones to be created in the future.
    • ML Engineer—Can view and manage deployments and cluster resources. Available only when Inference module is installed.
    • Department Administrator—Can manage Departments, descendent Projects and Workloads.

    For more information, Roles and permissions.

  6. (Optional) Select Cluster(s). This determines what Clusters are accessible to this User.

  7. Press "Save".

You will get the new user credentials and have the option to send the credentials by email.

Roles and permissions

Roles provide a way to group permissions and assign them to either users or user groups. The role identifies the collection of permissions that administrators assign to users or user groups. Permissions define the actions that users can perform on the managed entities. The following table shows the default roles and permissions.

Managed Entity / Roles Admin Dep. Admin Editor Research Manager Researcher ML Eng. Viewer
Assign (Settings) Users/Groups/Apps to Roles CRUD (all roles) CRUD (Proj. Researchers and ML Engineers only) N/A N/A N/A N/A N/A
Assign Users/Groups/Apps to Organizations R (Projects, Departments) CRUD (Projects only) CRUD (Projects, Departments) N/A N/A N/A N/A
Departments R R CRUD N/A N/A R R
Projects R CRUD CRUD R R R R
Jobs R R R R CRUD N/A R
Deployments R R R N/A N/A CRUD R
Workspaces R R R R CRUD N/A N/A
Environments CRUD CRUD CRUD CRUD CRUD N/A N/A
Data Sources CRUD CRUD CRUD CRUD CRUD N/A N/A
Compute Resources CRUD CRUD CRUD CRUD CRUD N/A N/A
Templates CRUD CRUD CRUD CRUD CRUD N/A N/A
Clusters CRUD N/A R N/A N/A R R
Node Pools CRUD N/A R N/A N/A R R
Nodes R N/A R N/A N/A R R
Settings (General, Credentials) CRUD N/A N/A N/A N/A N/A N/A
Events History R N/A N/A N/A N/A N/A N/A
Dashboard.Overview R R R R R R R
Dashboards.Analytics R R R R R R R
Dashboards.Consumption R N/A N/A N/A N/A N/A N/A

Permissions: C = Create, R = Read, U = Update, D = Delete


Last update: 2023-05-23
Created: 2020-07-16

Adding, Updating and Deleting Users

Introduction

The Run:ai User Interface allows the creation of Run:ai Users. Run:ai Users can receive varying levels of access to the Administration UI and submit Jobs on the Cluster.

Tip

It is possible to connect the Run:ai user interface to the organization's directory and use single sign-on. This allows you to set Run:ai roles for users and groups from the organizational directory. For further information see single sign-on configuration.

Working with Users

You can create users, as well as update and delete users.

Create a User

Note

To be able to review, add, update and delete users, you must have an Administrator access. If you do not have such access, please contact an Administrator.

Department Admin is available in version 2.10 and later.

  1. Login to the Users area of the Run:ai User interface at company-name.run.ai.
  2. Select the Users tab for local users, or the SSO Users tab for SSO users.
  3. On the top right, select "NEW USER".
  4. Enter the user's email.
  5. Select Roles. More than one role can be selected. Available roles are:

    • Administrator—Can manage Users and install Clusters.
    • Editor—Can manage Projects and Departments.
    • Viewer—View-only access to the Run:ai User Interface.
    • Researcher—Can submit ML workloads. Setting a user as a Researcher also requires assigning the user to projects.
    • Research Manager—Can act as Researcher in all projects, including new ones to be created in the future.
    • ML Engineer—Can view and manage deployments and cluster resources. Available only when Inference module is installed.
    • Department Administrator—Can manage Departments, descendent Projects and Workloads.

    For more information, Roles and permissions.

  6. (Optional) Select Cluster(s). This determines what Clusters are accessible to this User.

  7. Press "Save".

You will get the new user credentials and have the option to send the credentials by email.

Roles and permissions

Roles provide a way to group permissions and assign them to either users or user groups. The role identifies the collection of permissions that administrators assign to users or user groups. Permissions define the actions that users can perform on the managed entities. The following table shows the default roles and permissions.

Managed Entity / Roles Admin Dep. Admin Editor Research Manager Researcher ML Eng. Viewer
Assign (Settings) Users/Groups/Apps to Roles CRUD (all roles) CRUD (Proj. Researchers and ML Engineers only) N/A N/A N/A N/A N/A
Assign Users/Groups/Apps to Organizations R (Projects, Departments) CRUD (Projects only) CRUD (Projects, Departments) N/A N/A N/A N/A
Departments R R CRUD N/A N/A R R
Projects R CRUD CRUD R R R R
Jobs R R R R CRUD N/A R
Deployments R R R N/A N/A CRUD R
Workspaces R R R R CRUD N/A N/A
Environments CRUD CRUD CRUD CRUD CRUD N/A N/A
Data Sources CRUD CRUD CRUD CRUD CRUD N/A N/A
Compute Resources CRUD CRUD CRUD CRUD CRUD N/A N/A
Templates CRUD CRUD CRUD CRUD CRUD N/A N/A
Clusters CRUD N/A R N/A N/A R R
Node Pools CRUD N/A R N/A N/A R R
Nodes R N/A R N/A N/A R R
Settings (General, Credentials) CRUD N/A N/A N/A N/A N/A N/A
Events History R N/A N/A N/A N/A N/A N/A
Dashboard.Overview R R R R R R R
Dashboards.Analytics R R R R R R R
Dashboards.Consumption R N/A N/A N/A N/A N/A N/A

Permissions: C = Create, R = Read, U = Update, D = Delete


Last update: 2023-05-23
Created: 2020-07-16

Prerequisites

Below are the prerequisites of a cluster installed with Run:ai.

Prerequisites in a Nutshell

The following is a checklist of the Run:ai prerequisites:

Prerequisite Details
Kubernetes Verify certified vendor and correct version.
NVIDIA GPU Operator Different Kubernetes flavors have slightly different setup instructions.
Verify correct version.
Ingress Controller Install and configure NGINX (some Kubernetes flavors have NGINX pre-installed).
Prometheus Install Prometheus.
Trusted domain name You must provide a trusted domain name. Accessible only inside the organization
(Optional) Distributed Training Install Kubeflow Training Operator if required.
(Optional) Inference Some third party software needs to be installed to use the Run:ai inference module.

There are also specific hardware, operating system and network access requirements. A pre-install script is available to test if the prerequisites are met before installation.

Software Requirements

Operating System

  • Run:ai will work on any Linux operating system that is supported by both Kubernetes and NVIDIA.
  • An important highlight is that GKE (Google Kubernetes Engine) will only work with Ubuntu, as NVIDIA does not support the default Container-Optimized OS with Containerd image.
  • Run:ai performs its internal tests on Ubuntu 20.04 and CoreOS for OpenShift.

Kubernetes

Run:ai requires Kubernetes. Run:ai is been certified with the following Kubernetes distributions:

Kubernetes Distribution Description Installation Notes
Vanilla Kubernetes Using no specific distribution but rather k8s native installation See instructions for a simple (non-production-ready) Kubernetes Installation script.
OCP OpenShift Container Platform The Run:ai operator is certified for OpenShift by Red Hat.
EKS Amazon Elastic Kubernetes Service
AKS Azure Kubernetes Services
GKE Google Kubernetes Engine
RKE Rancher Kubernetes Engine When installing Run:ai, select On Premise. RKE2 has a defect which requires a specific installation flow. Please contact Run:ai customer support for additional details.
Bright NVIDIA Bright Cluster Manager In addition, NVIDIA DGX comes bundled with Run:ai

Run:ai has been tested with the following Kubernetes distributions. Please contact Run:ai Customer Support for up to date certification details:

Kubernetes Distribution Description Installation Notes
Ezmeral HPE Ezmeral Container Platform See Run:ai at Ezmeral marketplace
Tanzu VMWare Kubernetes Tanzu supports containerd rather than docker. See the NVIDIA prerequisites below as well as cluster customization for changes required for containerd

Following is a Kubernetes support matrix for the latest Run:ai releases:

Run:ai version Supported Kubernetes versions Supported OpenShift versions
Run:ai 2.9 1.21 through 1.26 4.8 through 4.11
Run:ai 2.10 1.21 through 1.26 (see note below) 4.8 through 4.11
Run:ai 2.12 1.23 through 1.27 (see note below) 4.10 through 4.12
Run:ai 2.13 1.23 through 1.27 (see note below) 4.10 through 4.12

Note

Run:ai allows scheduling of Jobs with PVCs. See for example the command-line interface flag --pvc-new. A Job scheduled with a PVC based on a specific type of storage class (a storage class with the property volumeBindingMode equals to WaitForFirstConsumer) will not work on Kubernetes 1.23 or lower.

For an up-to-date end-of-life statement of Kubernetes see Kubernetes Release History.

Run:ai does not support Pod Security Admission. Support for Pod Security Policy has been removed with Run:ai 2.9.

NVIDIA

Run:ai has been certified on NVIDIA GPU Operator 22.9 to 23.3. Older versions (1.10 and 1.11) have a documented NVIDIA issue. Follow the Getting Started guide to install the NVIDIA GPU Operator, or see the distribution-specific instructions below:

  • When setting up EKS, do not install the NVIDIA device plug-in (as we want the NVIDIA GPU Operator to install it instead). When using the eksctl tool to create an AWS EKS cluster, use the flag --install-nvidia-plugin=false to disable this install.
  • Follow the Getting Started guide to install the NVIDIA GPU Operator. For GPU nodes, EKS uses an AMI which already contains the NVIDIA drivers. As such, you must use the GPU Operator flags: --set driver.enabled=false.

Create the gpu-operator namespace by running

kubectl create ns gpu-operator
+                   

Prerequisites

Below are the prerequisites of a cluster installed with Run:ai.

Prerequisites in a Nutshell

The following is a checklist of the Run:ai prerequisites:

Prerequisite Details
Kubernetes Verify certified vendor and correct version.
NVIDIA GPU Operator Different Kubernetes flavors have slightly different setup instructions.
Verify correct version.
Ingress Controller Install and configure NGINX (some Kubernetes flavors have NGINX pre-installed).
Prometheus Install Prometheus.
Trusted domain name You must provide a trusted domain name. Accessible only inside the organization
(Optional) Distributed Training Install Kubeflow Training Operator if required.
(Optional) Inference Some third party software needs to be installed to use the Run:ai inference module.

There are also specific hardware, operating system and network access requirements. A pre-install script is available to test if the prerequisites are met before installation.

Software Requirements

Operating System

  • Run:ai will work on any Linux operating system that is supported by both Kubernetes and NVIDIA.
  • An important highlight is that GKE (Google Kubernetes Engine) will only work with Ubuntu, as NVIDIA does not support the default Container-Optimized OS with Containerd image.
  • Run:ai performs its internal tests on Ubuntu 20.04 and CoreOS for OpenShift.

Kubernetes

Run:ai requires Kubernetes. Run:ai is been certified with the following Kubernetes distributions:

Kubernetes Distribution Description Installation Notes
Vanilla Kubernetes Using no specific distribution but rather k8s native installation See instructions for a simple (non-production-ready) Kubernetes Installation script.
OCP OpenShift Container Platform The Run:ai operator is certified for OpenShift by Red Hat.
EKS Amazon Elastic Kubernetes Service
AKS Azure Kubernetes Services
GKE Google Kubernetes Engine
RKE Rancher Kubernetes Engine When installing Run:ai, select On Premise. RKE2 has a defect which requires a specific installation flow. Please contact Run:ai customer support for additional details.
Bright NVIDIA Bright Cluster Manager In addition, NVIDIA DGX comes bundled with Run:ai

Run:ai has been tested with the following Kubernetes distributions. Please contact Run:ai Customer Support for up to date certification details:

Kubernetes Distribution Description Installation Notes
Ezmeral HPE Ezmeral Container Platform See Run:ai at Ezmeral marketplace
Tanzu VMWare Kubernetes Tanzu supports containerd rather than docker. See the NVIDIA prerequisites below as well as cluster customization for changes required for containerd

Following is a Kubernetes support matrix for the latest Run:ai releases:

Run:ai version Supported Kubernetes versions Supported OpenShift versions
Run:ai 2.9 1.21 through 1.26 4.8 through 4.11
Run:ai 2.10 1.21 through 1.26 (see note below) 4.8 through 4.11
Run:ai 2.12 1.23 through 1.27 (see note below) 4.10 through 4.12
Run:ai 2.13 1.23 through 1.27 (see note below) 4.10 through 4.12

Note

Run:ai allows scheduling of Jobs with PVCs. See for example the command-line interface flag --pvc-new. A Job scheduled with a PVC based on a specific type of storage class (a storage class with the property volumeBindingMode equals to WaitForFirstConsumer) will not work on Kubernetes 1.23 or lower.

For an up-to-date end-of-life statement of Kubernetes see Kubernetes Release History.

Run:ai does not support Pod Security Admission. Support for Pod Security Policy has been removed with Run:ai 2.9.

NVIDIA

Run:ai has been certified on NVIDIA GPU Operator 22.9 to 23.3. Older versions (1.10 and 1.11) have a documented NVIDIA issue.

Follow the Getting Started guide to install the NVIDIA GPU Operator, or see the distribution-specific instructions below:

  • When setting up EKS, do not install the NVIDIA device plug-in (as we want the NVIDIA GPU Operator to install it instead). When using the eksctl tool to create an AWS EKS cluster, use the flag --install-nvidia-plugin=false to disable this install.
  • Follow the Getting Started guide to install the NVIDIA GPU Operator. For GPU nodes, EKS uses an AMI which already contains the NVIDIA drivers. As such, you must use the GPU Operator flags: --set driver.enabled=false.

Create the gpu-operator namespace by running

kubectl create ns gpu-operator
 

Before installing the GPU Operator you must create the following file:

resourcequota.yaml
apiVersion: v1
 kind: ResourceQuota
 metadata:
@@ -27,7 +27,7 @@
       values:
       - system-node-critical
       - system-cluster-critical
-

Then run: kubectl apply -f resourcequota.yaml

Important

  • Run:ai on GKE has only been tested with GPU Operator version 22.9 and up.
  • The above only works for Run:ai 2.7.16 and above.

Install the NVIDIA GPU Operator as discussed here.

Notes

  • Use the default namespace gpu-operator. Otherwise, you must specify the target namespace using the flag runai-operator.config.nvidiaDcgmExporter.namespace as described in customized cluster installation.
  • NVIDIA drivers may already be installed on the nodes. In such cases, use the NVIDIA GPU Operator flags --set driver.enabled=false. DGX OS is one such example as it comes bundled with NVIDIA Drivers.
  • To work with containerd (e.g. for Tanzu), use the defaultRuntime flag accordingly.
  • To use Dynamic MIG, the GPU Operator must be installed with the flag mig.strategy=mixed. If the GPU Operator is already installed, edit the clusterPolicy by running kubectl patch clusterPolicy cluster-policy -n gpu-operator --type=merge -p '{"spec":{"mig":{"strategy": "mixed"}}}

Ingress Controller

Run:ai requires an ingress controller as a prerequisite. The Run:ai cluster installation configures one or more ingress objects on top of the controller.

There are many ways to install and configure an ingress controller and configuration is environment-dependent. A simple solution is to install & configure NGINX:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
+

Then run: kubectl apply -f resourcequota.yaml

Important

  • Run:ai on GKE has only been tested with GPU Operator version 22.9 and up.
  • The above only works for Run:ai 2.7.16 and above.

Install the NVIDIA GPU Operator as discussed here.

Notes

  • Use the default namespace gpu-operator. Otherwise, you must specify the target namespace using the flag runai-operator.config.nvidiaDcgmExporter.namespace as described in customized cluster installation.
  • NVIDIA drivers may already be installed on the nodes. In such cases, use the NVIDIA GPU Operator flags --set driver.enabled=false. DGX OS is one such example as it comes bundled with NVIDIA Drivers.
  • To use Dynamic MIG, the GPU Operator must be installed with the flag mig.strategy=mixed. If the GPU Operator is already installed, edit the clusterPolicy by running kubectl patch clusterPolicy cluster-policy -n gpu-operator --type=merge -p '{"spec":{"mig":{"strategy": "mixed"}}}

Ingress Controller

Run:ai requires an ingress controller as a prerequisite. The Run:ai cluster installation configures one or more ingress objects on top of the controller.

There are many ways to install and configure an ingress controller and configuration is environment-dependent. A simple solution is to install & configure NGINX:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
 helm repo update
 helm upgrade -i nginx-ingress ingress-nginx/ingress-nginx   \
     --namespace nginx-ingress --create-namespace \
@@ -41,20 +41,20 @@
 kubectl create secret tls runai-cluster-domain-tls-secret -n runai \
     --cert /path/to/fullchain.pem  \ # (1)
     --key /path/to/private.pem # (2)
-
  1. The domain's cert (public key).
  2. The domain's private key.

For more information on how to create a TLS secret see: https://kubernetes.io/docs/concepts/configuration/secret/#tls-secrets.

Note

In a self-hosted installation, the typical scenario is to install the first Run:ai cluster on the same Kubernetes cluster as the control plane. In this case, the cluster URL need not be provided as it will be the same as the control-plane URL.

Prometheus

If not already installed on your cluster, install the full kube-prometheus-stack through the Prometheus community Operator.

Note

  • If Prometheus has been installed on the cluster in the past, even if it was uninstalled (such as when upgrading from Run:ai 2.8 or lower), you will need to update Prometheus CRDs as described here. For more information on the Prometheus bug see here.
  • If you are running Kubernetes 1.21, you must install a Prometheus stack version of 45.23.0 or lower. Use the --version flag below. Alternatively, use helm version 3.12 or later. For more information on the related Prometheus bug see here

Then install the Prometheus stack by running:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+
  1. The domain's cert (public key).
  2. The domain's private key.

For more information on how to create a TLS secret see: https://kubernetes.io/docs/concepts/configuration/secret/#tls-secrets.

Note

In a self-hosted installation, the typical scenario is to install the first Run:ai cluster on the same Kubernetes cluster as the control plane. In this case, the cluster URL need not be provided as it will be the same as the control-plane URL.

Prometheus

If not already installed on your cluster, install the full kube-prometheus-stack through the Prometheus community Operator.

Note

  • If Prometheus has been installed on the cluster in the past, even if it was uninstalled (such as when upgrading from Run:ai 2.8 or lower), you will need to update Prometheus CRDs as described here. For more information on the Prometheus bug see here.
  • If you are running Kubernetes 1.21, you must install a Prometheus stack version of 45.23.0 or lower. Use the --version flag below. Alternatively, use Helm version 3.12 or later. For more information on the related Prometheus bug see here

Then install the Prometheus stack by running:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
 helm repo update
 helm install prometheus prometheus-community/kube-prometheus-stack \
     -n monitoring --create-namespace --set grafana.enabled=false # (1)
-
  1. The Grafana component is not required for Run:ai.

Optional Software Requirements

The following software enables specific features of Run:ai

Distributed Training

Run:ai supports three different methods to distributed-training jobs across multiple nodes:

  • MPI
  • TensorFlow
  • PyTorch

To install all 3 prerequisites run the following:

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.5.0"
+
  1. The Grafana component is not required for Run:ai.

Optional Software Requirements

The following software enables specific features of Run:ai

Distributed Training

Run:ai supports three different methods to distributed-training jobs across multiple nodes:

  • MPI
  • TensorFlow
  • PyTorch

To install all three, run the following:

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.5.0"
 

Inference

To use the Run:ai inference module you must pre-install Knative Serving. Follow the instructions here to install. Run:ai is certified on Knative 1.4 to 1.8 with Kubernetes 1.22 or later.

Post-install, you must configure Knative to use the Run:ai scheduler and allow pod affinity, by running:

kubectl patch configmap/config-features \
   --namespace knative-serving \
   --type merge \
   --patch '{"data":{"kubernetes.podspec-schedulername":"enabled","kubernetes.podspec-affinity":"enabled"}}'
 

Inference Autoscaling

Run:ai allows to autoscale a deployment according to various metrics:

  1. GPU Utilization (%)
  2. CPU Utilization (%)
  3. Latency (milliseconds)
  4. Throughput (requests/second)
  5. Concurrency
  6. Any custom metric

Additional installation may be needed for some of the metrics as follows:

  • Using Throughput or Concurrency does not require any additional installation.
  • Any other metric will require installing the HPA Autoscaler.
  • Using GPU Utilization, Latency or Custom metric will also require the Prometheus adapter. The Prometheus adapter is part of the Run:ai installer and can be added by setting the prometheus-adapter.enabled flag to true. See Customizing the Run:ai installation for further information.

If you wish to use an existing Prometheus adapter installation, you will need to configure it manually with the Run:ai Prometheus rules, specified in the Run:ai chart values under prometheus-adapter.rules field. For further information please contact Run:ai customer support.

Accessing Inference from outside the Cluster

Inference workloads will typically be accessed by consumers residing outside the cluster. You will hence want to provide consumers with a URL to access the workload. The URL can be found in the Run:ai user interface under the deployment screen (alternatively, run kubectl get ksvc -n <project-namespace>).

However, for the URL to be accessible outside the cluster you must configure your DNS as described here.

Alternative Configuration

When the above DNS configuration is not possible, you can manually add the Host header to the REST request as follows:

  • Get an <external-ip> by running kubectl get service -n kourier-system kourier. If you have been using istio during Run:ai installation, run: kubectl -n istio-system get service istio-ingressgateway instead.
  • Send a request to your workload by using the external ip, and place the workload url as a Host header. For example
curl http://<external-ip>/<container-specific-path>
     -H 'Host: <host-name>'
-

Hardware Requirements

(see picture below)

  • (Production only) Run:ai System Nodes: To reduce downtime and save CPU cycles on expensive GPU Machines, we recommend that production deployments will contain two or more worker machines, designated for Run:ai Software. The nodes do not have to be dedicated to Run:ai, but for Run:ai purposes we would need:

    • 8 CPUs
    • 16GB of RAM
    • 50GB of Disk space
  • Shared data volume: Run:ai uses Kubernetes to abstract away the machine on which a container is running:

    • Researcher containers: The Researcher's containers need to be able to access data from any machine in a uniform way, to access training data and code as well as save checkpoints, weights, and other machine-learning-related artifacts.
    • The Run:ai system needs to save data on a storage device that is not dependent on a specific node.

    Typically, this is achieved via Network File Storage (NFS) or Network-attached storage (NAS).

  • Docker Registry: With Run:ai, Workloads are based on Docker images. For container images to run on any machine, these images must be downloaded from a docker registry rather than reside on the local machine (though this also is possible). You can use a public registry such as docker hub or set up a local registry on-prem (preferably on a dedicated machine). Run:ai can assist with setting up the repository.

  • Kubernetes: Production Kubernetes installation requires separate nodes for the Kubernetes master. For more details see your specific Kubernetes distribution documentation.

img/prerequisites.png

User requirements

Usage of containers and images: The individual Researcher's work must be based on container images.

Network Access Requirements

Internal networking: Kubernetes networking is an add-on rather than a core part of Kubernetes. Different add-ons have different network requirements. You should consult the documentation of the specific add-on on which ports to open. It is however important to note that unless special provisions are made, Kubernetes assumes all cluster nodes can interconnect using all ports.

Outbound network: Run:ai user interface runs from the cloud. All container nodes must be able to connect to the Run:ai cloud. Inbound connectivity (connecting from the cloud into nodes) is not required. If outbound connectivity is limited, the following exceptions should be applied:

During Installation

Run:ai requires an installation over the Kubernetes cluster. The installation access the web to download various images and registries. Some organizations place limitations on what you can pull from the internet. The following list shows the various solution components and their origin:

Name Description URLs Ports

Run:ai Repository

Run:ai Helm Package Repository

runai-charts.storage.googleapis.com

443

Docker Images Repository

Run:ai images

gcr.io/run-ai-prod

443

Docker Images Repository

Third party Images

hub.docker.com

quay.io

443

Run:ai

Run:ai Cloud instance

app.run.ai

443, 53

Post Installation

In addition, once running, Run:ai requires an outbound network connection to the following targets:

Name Description URLs Ports

Grafana

Grafana Metrics Server

prometheus-us-central1.grafana.net and runailabs.com

443

Run:ai

Run:ai Cloud instance

app.run.ai

443, 53

Network Proxy

If you are using a Proxy for outbound communication please contact Run:ai customer support

Pre-install Script

Once you believe that the Run:ai prerequisites are met, we highly recommend installing and running the Run:ai pre-install diagnostics script. The tool:

  • Tests the below requirements as well as additional failure points related to Kubernetes, NVIDIA, storage, and networking.
  • Looks at additional components installed and analyze their relevance to a successful Run:ai installation.

To use the script download the latest version of the script and run:

chmod +x preinstall-diagnostics-<platform>
+

Hardware Requirements

(see picture below)

  • (Production only) Run:ai System Nodes: To reduce downtime and save CPU cycles on expensive GPU Machines, we recommend that production deployments will contain two or more worker machines, designated for Run:ai Software. The nodes do not have to be dedicated to Run:ai, but for Run:ai purposes we would need:

    • 8 CPUs
    • 16GB of RAM
    • 50GB of Disk space
  • Shared data volume: Run:ai uses Kubernetes to abstract away the machine on which a container is running:

    • Researcher containers: The Researcher's containers need to be able to access data from any machine in a uniform way, to access training data and code as well as save checkpoints, weights, and other machine-learning-related artifacts.
    • The Run:ai system needs to save data on a storage device that is not dependent on a specific node.

    Typically, this is achieved via Kubernetes Storage class based on Network File Storage (NFS) or Network-attached storage (NAS).

  • Docker Registry: With Run:ai, Workloads are based on Docker images. For container images to run on any machine, these images must be downloaded from a docker registry rather than reside on the local machine (though this also is possible). You can use a public registry such as docker hub or set up a local registry on-prem (preferably on a dedicated machine). Run:ai can assist with setting up the repository.

  • Kubernetes: Production Kubernetes installation requires separate nodes for the Kubernetes master. For more details see your specific Kubernetes distribution documentation.

img/prerequisites.png

User requirements

Usage of containers and images: The individual Researcher's work must be based on container images.

Network Access Requirements

Internal networking: Kubernetes networking is an add-on rather than a core part of Kubernetes. Different add-ons have different network requirements. You should consult the documentation of the specific add-on on which ports to open. It is however important to note that unless special provisions are made, Kubernetes assumes all cluster nodes can interconnect using all ports.

Outbound network: Run:ai user interface runs from the cloud. All container nodes must be able to connect to the Run:ai cloud. Inbound connectivity (connecting from the cloud into nodes) is not required. If outbound connectivity is limited, the following exceptions should be applied:

During Installation

Run:ai requires an installation over the Kubernetes cluster. The installation access the web to download various images and registries. Some organizations place limitations on what you can pull from the internet. The following list shows the various solution components and their origin:

Name Description URLs Ports
Run:ai Repository Run:ai Helm Package Repository runai-charts.storage.googleapis.com 443
Docker Images Repository Run:ai images gcr.io/run-ai-prod 443
Docker Images Repository Third party Images hub.docker.com and quay.io 443
Run:ai Run:ai Cloud instance app.run.ai

Post Installation

In addition, once running, Run:ai requires an outbound network connection to the following targets:

Name Description URLs Ports
Grafana Grafana Metrics Server prometheus-us-central1.grafana.net and runailabs.com 443
Run:ai Run:ai Cloud instance app.run.ai 443, 53

Network Proxy

If you are using a Proxy for outbound communication please contact Run:ai customer support

Pre-install Script

Once you believe that the Run:ai prerequisites are met, we highly recommend installing and running the Run:ai pre-install diagnostics script. The tool:

  • Tests the below requirements as well as additional failure points related to Kubernetes, NVIDIA, storage, and networking.
  • Looks at additional components installed and analyze their relevance to a successful Run:ai installation.

To use the script download the latest version of the script and run:

chmod +x preinstall-diagnostics-<platform>
 ./preinstall-diagnostics-<platform>
-

If the script fails, or if the script succeeds but the Kubernetes system contains components other than Run:ai, locate the file runai-preinstall-diagnostics.txt in the current directory and send it to Run:ai technical support.

For more information on the script including additional command-line flags, see here.


Last update: 2023-07-25
Created: 2020-07-19

Run:ai version 2.13

Version 2.13.0

Release content

This version contains features and fixes from previous versions starting with 2.9. Refer to the prior versions for specific features and fixes. For information about features, functionality, and fixed issues in previous versions see:

Projects

  • Improved the Projects UI for ease of use. Projects follows UI upgrades and changes that are designed to make setting up of components and assets easier for administrators and researchers. To configure a project, see Projects.

Dashboards

  • Added a new dashboard for Quota management, which provides an efficient means to monitor and manage resource utilization within the AI cluster. The dashboard filters the display of resource quotas based on Departments, Projects, and Node pools. For more information, see Quota management dashboard.

  • Added to the Overview dashboard, the ability to filter the cluster by one or more node pools. For more information, see Node pools.

Nodes and Node pools

  • Run:ai scheduler supports 2 scheduling strategies: Bin Packing (default) and Spread. For more information, see Scheduling strategies. You can configure the scheduling strategy in the node pool level to improve the support of clusters with mixed types of resources and workloads. For configuration information, see Creating new node pools.

  • GPU device level DCGM Metrics are collected per GPU and presented by Run:ai in the Nodes table. Each node contains a list of its embedded GPUs with their respective DCGM metrics. See DCGM Metrics for the list of metrics which are provided by NVidia DCGM and collected by Run:ai. Contact your Run:ai customer representative to enable this feature.

  • Added per node pool over-quota priority. Over-quota priority sets the relative amount of additional unused resources that an asset can get above its current quota. For more information, see Over-quota priority.
  • Added support of associating workspaces to node pool. The association between workspaces and node pools is done using Compute resources section. In order to associate a compute resource to a node pool, in the Compute resource section, press More settings. Press Add new to add more node pools to the configuration. Drag and drop the node pools to set their priority.
  • Added Node pool selection as part of the workload submission form. This allows researchers to quickly determine the list of node pools available and their priority. Priority is set by dragging and dropping them in the desired order of priority. In addition, when the node pool priority list is locked by a policy, the list isn't editable by the Researcher even if the workspace is created from a template or copied from another workspace.

Time limit duration

  • Improved the behavior of any workload time limit (for example, Idle time limit) so that the time limit will affect existing workloads that were created before the time limit was configured. This is an optional feature which provides help in handling situations where researchers leave sessions open even when they do not need to access the resources. For more information, see Limit duration of interactive training jobs.

  • Improved workspaces time limits. Workspaces that reach a time limit will now transition to a state of stopped so that they can be reactivated later.

  • Added time limits for training jobs per project. Administrators (Department Admin, Editor) can limit the duration of Run:ai Training jobs per Project using a specified time limit value. This capability can assist administrators to limit the duration and resources consumed over time by training jobs in specific projects. Each training job that reaches this duration will be terminated.

Workload assets

  • Extended the collaboration functionality for any workload asset such as Environment, Compute resource, and some Data source types. These assets are now shared with Departments in the organization in addition to being shared with specific projects, or the entire cluster.
  • Added a search box for card galleries in any asset based workload creation form to provide an easy way to search for assets and resources. To filter use the asset name or one of the field values of the card.

PVC data sources

  • Added support for PVC block storage in the New data source form. In the New data source form for a new PVC data source, in the Volume mode field, select from Filesystem or Block. For more information, see Create a PVC data source.

Credentials

  • Added Docker registry to the Credentials menu. Users can create docker credentials for use in specific projects for image pulling. To configure credentials, see Configuring credentials.

Policies

  • Improved policy support by adding DEFAULTS in the items section in the policy. The DEFAULTS section sets the default behavior for items declared in this section. For example, this can be use to limit the submission of workloads only to existing PVCs. For more information and an example, see Policies, Complex values.
  • Added support for making a PVC data source available to all projects. In the New data source form, when creating a new PVC data source, select All from the Project pane.

Researcher API

Integrations

  • Added support for Ray jobs. Ray is an open-source unified framework for scaling AI and Python applications. For more information, see Integrate Run:ai with Ray.

  • Added integration with Weights & Biases Sweep to allow data scientists to submit hyperparameter optimization workloads directly from the Run:ai UI. To configure sweep, see Sweep configuration.

  • Added support for XGBoost. XGBoost, which stands for Extreme Gradient Boosting, is a scalable, distributed gradient-boosted decision tree (GBDT) machine learning library. It provides parallel tree boosting and is the leading machine learning library for regression, classification, and ranking problems. For more information, see runai submit-dist xgboost

Compatability

Installation

  • The manual process of upgrading Kubernetes CRDs is no longer needed when upgrading to the most recent version (2.13) of Run:ai.
  • From Run:ai 2.12 and above, the control-plane installation has been simplified and no longer requires the creation of a backend values file. Instead, install directly using helm as described in Install the Run:ai Control Plane.
  • From Run:ai 2.12 and above, the air-gapped, control-plane installation now generates a custom-env.yaml values file during the preparation stage. This is used when installing the control-plane.

Known issues

Internal ID Description
RUN-11005 Incorrect error messages when trying to run runai CLI commands in an OpenShift environment.
RUN-11009 Incorrect error message when a user without permissions to tries to delete another user.

Fixed issues

Internal ID Description
RUN-9039 Fixed an issue where in the new job screen, after toggling off the preemptible flag, and a job is submitted, the job still shows as preemptible.
RUN-9323 Fixed an issue with a non-scaleable error message when scheduling hundreds of nodes is not successful.
RUN-9324 Fixed an issue where the scheduler did not take into consideration the amount of storage so there is no explanation that pvc is not ready.
RUN-9902 Fixed an issue in OpenShift environments, where there are no metrics in the dashboard because Prometheus doesn’t have permissions to monitor the runai namespace after an installation or upgrade to 2.9.
RUN-9920 Fixed an issue where the canEdit key in a policy is not validated properly for itemized fields when configuring an interactive policy.
RUN-10052 Fixed an issue when loading a new job from a template gives an error until there are changes made on the form.
RUN-10053 Fixed an issue where the Node pool column is unsearchable in the job list.
RUN-10422 Fixed an issue where node details show running workloads that were actually finished (successfully/failed/etc.).
RUN-10500 Fixed an issue where jobs are shown as running even though they don't exist in the cluster.
RUN-10813 Fixed an issue in adding a data source where the path is case sensitive and didn't allow uppercase.

Last update: 2023-07-13
Created: 2023-06-14

Run:ai version 2.13

Version 2.13.7

Release date

July 2023

Release content

  • Added filters to the historic quota ratio widget on the Quota management dashboard.

Fixed issues

Internal ID Description
RUN-11080 Fixed an issue in OpenShift environments where log in via SSO with the kubeadmin user, gets blank pages for every page.
RUN-11119 Fixed an issue where values that should be the Order of priority column are in the wrong column.
RUN-11120 Fixed an issue where the Projects table does not show correct metrics when Run:ai version 2.13 is paired with a Run:ai 2.8 cluster.
RUN-11121 Fixed an issue where the wrong over quota memory alert is shown in the Quota management pane in project edit form.
RUN-11272 Fixed an issue in OpenShift environments where the selection in the cluster drop down in the main UI does not match the cluster selected on the login page.
## Version 2.13.4

Release date

July 2023

Fixed issues

Internal ID Description
RUN-11089 Fixed an issue when creating an environment, commands in the Runtime settings pane and are not persistent and cannot be found in other assets (for example in a new Training).

Version 2.13.1

Release date

July 2023

Release content

  • Made an improvement so that occurrences of labels that are not in use anymore are deleted.

Fixed issues

N/A

Version 2.13.0

Release content

This version contains features and fixes from previous versions starting with 2.9. Refer to the prior versions for specific features and fixes. For information about features, functionality, and fixed issues in previous versions see:

Projects

  • Improved the Projects UI for ease of use. Projects follows UI upgrades and changes that are designed to make setting up of components and assets easier for administrators and researchers. To configure a project, see Projects.

Dashboards

  • Added a new dashboard for Quota management, which provides an efficient means to monitor and manage resource utilization within the AI cluster. The dashboard filters the display of resource quotas based on Departments, Projects, and Node pools. For more information, see Quota management dashboard.

  • Added to the Overview dashboard, the ability to filter the cluster by one or more node pools. For more information, see Node pools.

Nodes and Node pools

  • Run:ai scheduler supports 2 scheduling strategies: Bin Packing (default) and Spread. For more information, see Scheduling strategies. You can configure the scheduling strategy in the node pool level to improve the support of clusters with mixed types of resources and workloads. For configuration information, see Creating new node pools.

  • GPU device level DCGM Metrics are collected per GPU and presented by Run:ai in the Nodes table. Each node contains a list of its embedded GPUs with their respective DCGM metrics. See DCGM Metrics for the list of metrics which are provided by NVidia DCGM and collected by Run:ai. Contact your Run:ai customer representative to enable this feature.

  • Added per node pool over-quota priority. Over-quota priority sets the relative amount of additional unused resources that an asset can get above its current quota. For more information, see Over-quota priority.
  • Added support of associating workspaces to node pool. The association between workspaces and node pools is done using Compute resources section. In order to associate a compute resource to a node pool, in the Compute resource section, press More settings. Press Add new to add more node pools to the configuration. Drag and drop the node pools to set their priority.
  • Added Node pool selection as part of the workload submission form. This allows researchers to quickly determine the list of node pools available and their priority. Priority is set by dragging and dropping them in the desired order of priority. In addition, when the node pool priority list is locked by a policy, the list isn't editable by the Researcher even if the workspace is created from a template or copied from another workspace.

Time limit duration

  • Improved the behavior of any workload time limit (for example, Idle time limit) so that the time limit will affect existing workloads that were created before the time limit was configured. This is an optional feature which provides help in handling situations where researchers leave sessions open even when they do not need to access the resources. For more information, see Limit duration of interactive training jobs.

  • Improved workspaces time limits. Workspaces that reach a time limit will now transition to a state of stopped so that they can be reactivated later.

  • Added time limits for training jobs per project. Administrators (Department Admin, Editor) can limit the duration of Run:ai Training jobs per Project using a specified time limit value. This capability can assist administrators to limit the duration and resources consumed over time by training jobs in specific projects. Each training job that reaches this duration will be terminated.

Workload assets

  • Extended the collaboration functionality for any workload asset such as Environment, Compute resource, and some Data source types. These assets are now shared with Departments in the organization in addition to being shared with specific projects, or the entire cluster.
  • Added a search box for card galleries in any asset based workload creation form to provide an easy way to search for assets and resources. To filter use the asset name or one of the field values of the card.

PVC data sources

  • Added support for PVC block storage in the New data source form. In the New data source form for a new PVC data source, in the Volume mode field, select from Filesystem or Block. For more information, see Create a PVC data source.

Credentials

  • Added Docker registry to the Credentials menu. Users can create docker credentials for use in specific projects for image pulling. To configure credentials, see Configuring credentials.

Policies

  • Improved policy support by adding DEFAULTS in the items section in the policy. The DEFAULTS section sets the default behavior for items declared in this section. For example, this can be use to limit the submission of workloads only to existing PVCs. For more information and an example, see Policies, Complex values.
  • Added support for making a PVC data source available to all projects. In the New data source form, when creating a new PVC data source, select All from the Project pane.

Researcher API

Integrations

  • Added support for Ray jobs. Ray is an open-source unified framework for scaling AI and Python applications. For more information, see Integrate Run:ai with Ray.

  • Added integration with Weights & Biases Sweep to allow data scientists to submit hyperparameter optimization workloads directly from the Run:ai UI. To configure sweep, see Sweep configuration.

  • Added support for XGBoost. XGBoost, which stands for Extreme Gradient Boosting, is a scalable, distributed gradient-boosted decision tree (GBDT) machine learning library. It provides parallel tree boosting and is the leading machine learning library for regression, classification, and ranking problems. For more information, see runai submit-dist xgboost

Compatability

Installation

  • The manual process of upgrading Kubernetes CRDs is no longer needed when upgrading to the most recent version (2.13) of Run:ai.
  • From Run:ai 2.12 and above, the control-plane installation has been simplified and no longer requires the creation of a backend values file. Instead, install directly using helm as described in Install the Run:ai Control Plane.
  • From Run:ai 2.12 and above, the air-gapped, control-plane installation now generates a custom-env.yaml values file during the preparation stage. This is used when installing the control-plane.

Known issues

Internal ID Description
RUN-11005 Incorrect error messages when trying to run runai CLI commands in an OpenShift environment.
RUN-11009 Incorrect error message when a user without permissions to tries to delete another user.

Fixed issues

Internal ID Description
RUN-9039 Fixed an issue where in the new job screen, after toggling off the preemptible flag, and a job is submitted, the job still shows as preemptible.
RUN-9323 Fixed an issue with a non-scaleable error message when scheduling hundreds of nodes is not successful.
RUN-9324 Fixed an issue where the scheduler did not take into consideration the amount of storage so there is no explanation that pvc is not ready.
RUN-9902 Fixed an issue in OpenShift environments, where there are no metrics in the dashboard because Prometheus doesn’t have permissions to monitor the runai namespace after an installation or upgrade to 2.9.
RUN-9920 Fixed an issue where the canEdit key in a policy is not validated properly for itemized fields when configuring an interactive policy.
RUN-10052 Fixed an issue when loading a new job from a template gives an error until there are changes made on the form.
RUN-10053 Fixed an issue where the Node pool column is unsearchable in the job list.
RUN-10422 Fixed an issue where node details show running workloads that were actually finished (successfully/failed/etc.).
RUN-10500 Fixed an issue where jobs are shown as running even though they don't exist in the cluster.
RUN-10813 Fixed an issue in adding a data source where the path is case sensitive and didn't allow uppercase.

Last update: 2023-07-27
Created: 2023-06-14