diff --git a/docs/Researcher/cli-reference/runai-submit-dist-TF.md b/docs/Researcher/cli-reference/runai-submit-dist-TF.md index e71ba4d931..824a926b4e 100644 --- a/docs/Researcher/cli-reference/runai-submit-dist-TF.md +++ b/docs/Researcher/cli-reference/runai-submit-dist-TF.md @@ -5,7 +5,7 @@ Submit a distributed TensorFlow training run:ai job to run. !!! Note - To use distributed training you need to have installed the < insert TensorFlow operator here > as specified < insert pre-requisites link here >. + To use distributed training you need to have installed the TensorFlow operator as specified in [Distributed training](../../admin/runai-setup/cluster-setup/cluster-prerequisites.md#distributed-training). Syntax notes: diff --git a/docs/Researcher/cli-reference/runai-submit-dist-mpi.md b/docs/Researcher/cli-reference/runai-submit-dist-mpi.md index 83de206581..04e591b360 100644 --- a/docs/Researcher/cli-reference/runai-submit-dist-mpi.md +++ b/docs/Researcher/cli-reference/runai-submit-dist-mpi.md @@ -3,7 +3,7 @@ Submit a Distributed Training (MPI) Run:ai Job to run. !!! Note - To use distributed training you need to have installed the Kubeflow MPI Operator as specified [here](../../../admin/runai-setup/cluster-setup/cluster-prerequisites/#distributed-training-via-kubeflow-mpi) + To use distributed training you need to have installed the Kubeflow MPI Operator as specified in [Distributed training](../../admin/runai-setup/cluster-setup/cluster-prerequisites.md#distributed-training). Syntax notes: diff --git a/docs/Researcher/cli-reference/runai-submit-dist-pytorch.md b/docs/Researcher/cli-reference/runai-submit-dist-pytorch.md index 9ed82d7a9a..96365ce10e 100644 --- a/docs/Researcher/cli-reference/runai-submit-dist-pytorch.md +++ b/docs/Researcher/cli-reference/runai-submit-dist-pytorch.md @@ -5,7 +5,7 @@ Submit a distributed PyTorch training run:ai job to run. !!! Note - To use distributed training you need to have installed the < insert pytorch operator here > as specified < insert pre-requisites link here >. + To use distributed training you need to have installed the Pytorch operator as specified in [Distributed training](../../admin/runai-setup/cluster-setup/cluster-prerequisites.md#distributed-training). Syntax notes: diff --git a/docs/admin/admin-ui-setup/admin-ui-users.md b/docs/admin/admin-ui-setup/admin-ui-users.md index 01701d7e3e..87e65dbbbb 100644 --- a/docs/admin/admin-ui-setup/admin-ui-users.md +++ b/docs/admin/admin-ui-setup/admin-ui-users.md @@ -2,10 +2,10 @@ ## Introduction -The Run:ai User Interface allows the creation of Run:ai Users. Run:ai Users can receive varying levels of access to the Administration UI and submit Jobs on the Cluster. +The Run:ai UI allows the creation of Run:ai Users. Users are assigned levels of access to all aspects of the UI including submitting jobs on the cluster. !!! Tip - It is possible to connect the Run:ai user interface to the organization's directory and use single sign-on. This allows you to set Run:ai roles for users and groups from the organizational directory. For further information see [single sign-on configuration](../runai-setup/authentication/sso.md). + It is possible to connect the Run:ai UI to the organization's directory and use single sign-on (SSO). This allows you to set Run:ai roles for users and groups from the organizational directory. For further information see [single sign-on configuration](../runai-setup/authentication/sso.md). ## Working with Users @@ -14,54 +14,29 @@ You can create users, as well as update and delete users. ### Create a User !!! Note - To be able to review, add, update and delete users, you must have an *Administrator* access. If you do not have such access, please contact an Administrator. + To be able to review, add, update and delete users, you must have *System Administrator* access. To upgrade your access, contact a system administrator. -:octicons-versions-24: Department Admin is available in version 2.10 and later. +To create a new user: -1. Login to the Users area of the Run:ai User interface at `company-name.run.ai`. -2. Select the `Users` tab for local users, or the `SSO Users` tab for SSO users. -3. On the top right, select "NEW USER". -4. Enter the user's email. -5. Select Roles. More than one role can be selected. Available roles are: - * **Administrator**—Can manage Users and install Clusters. - * **Editor**—Can manage Projects and Departments. - * **Viewer**—View-only access to the Run:ai User Interface. - * **Researcher**—Can submit ML workloads. Setting a user as a *Researcher* also requires [assigning the user to projects](../project-setup/#create-a-new-project.md). - * **Research Manager**—Can act as *Researcher* in all projects, including new ones to be created in the future. - * **ML Engineer**—Can view and manage deployments and cluster resources. Available only when [Inference module is installed](../workloads/inference-overview.md). - * **Department Administrator**—Can manage Departments, descendent Projects and Workloads. +1. Login to the Run:ai UI at `company-name.run.ai`. +2. Press the ![Tools and Settings](img/tools-and-settings.svg) icon, then select *Users*. +3. Press *New user* and enter the user's email address, then press *Create*. +4. Review the new user information and note the temporary password that has been assigned. To send the user an introductory email, select the checkbox. +5. Press *Done* when complete. - For more information, [Roles and permissions](#roles-and-permissions). +## Assigning access rules to users -6. (Optional) Select Cluster(s). This determines what Clusters are accessible to this User. -7. Press "Save". +Once you have created the users you can assign them *Access rules*. This provides the needed authorization to access system assets and resources. -You will get the new user credentials and have the option to send the credentials by email. +To add an *Access rule* to a user: -### Roles and permissions - -Roles provide a way to group permissions and assign them to either users or user groups. The role identifies the collection of permissions that administrators assign to users or user groups. Permissions define the actions that users can perform on the managed entities. The following table shows the default roles and permissions. +1. Select the user, then press *Access rules*, then press *+Access rule*. +2. Select a *Role* from the dropdown. +3. Press ![Scope](../../images/scope-icon.svg) then select a scope for the user. You can select multiple scopes. +4. After selecting all the required scopes, press *Save rule*. +5. To add another rule, use the *+Access rule*. +6. Press *Done* when all the rules are configured. -| Managed Entity / Roles | Admin | Dep. Admin | Editor | Research Manager | Researcher | ML Eng. | Viewer | -|:--|:--|:--|:--|:--|:--|:--|:--| -| Assign (Settings) Users/Groups/Apps to Roles | CRUD (all roles) | CRUD (Proj. Researchers and ML Engineers only) | N/A | N/A | N/A | N/A | N/A | -| Assign Users/Groups/Apps to Organizations | R (Projects, Departments) | CRUD (Projects only) | CRUD (Projects, Departments) | N/A | N/A | N/A | N/A | -| Departments | R | R | CRUD | N/A | N/A | R | R | -| Projects | R | CRUD | CRUD | R | R | R | R | -| Jobs | R | R | R | R | CRUD | N/A | R | -| Deployments | R | R | R | N/A | N/A | CRUD | R | -| Workspaces | R | R | R | R | CRUD | N/A | N/A | -| Environments | CRUD | CRUD | CRUD | CRUD | CRUD | N/A | N/A | -| Data Sources | CRUD | CRUD | CRUD | CRUD | CRUD | N/A | N/A | -| Compute Resources | CRUD | CRUD | CRUD | CRUD | CRUD | N/A | N/A | -| Templates | CRUD | CRUD | CRUD | CRUD | CRUD | N/A | N/A | -| Clusters | CRUD | N/A | R | N/A | N/A | R | R | -| Node Pools | CRUD | N/A | R | N/A | N/A | R | R | -| Nodes | R | N/A | R | N/A | N/A | R | R | -| Settings (General, Credentials) | CRUD | N/A | N/A | N/A | N/A | N/A | N/A | -| Events History | R | N/A | N/A | N/A | N/A | N/A | N/A | -| Dashboard.Overview | R | R | R | R | R | R | R | -| Dashboards.Analytics | R | R | R | R | R | R | R | -| Dashboards.Consumption | R | N/A | N/A | N/A | N/A | N/A | N/A | +### Roles and permissions -Permissions: **C** = Create, **R** = Read, **U** = Update, **D** = Delete +Roles provide a way for administrators to group and identify collections of permissions that administrators assign to [subjects](../runai-setup/access-control/rbac.md#subjects). Permissions define the actions that can be performed on managed entities. The [Roles](../runai-setup/access-control/rbac.md#roles) table shows the default roles and permissions that come with the syste. See [Role based access control](../runai-setup/access-control/rbac.md) for more information. diff --git a/docs/admin/admin-ui-setup/department-setup.md b/docs/admin/admin-ui-setup/department-setup.md index 20abbb479b..fc00c9e089 100644 --- a/docs/admin/admin-ui-setup/department-setup.md +++ b/docs/admin/admin-ui-setup/department-setup.md @@ -56,21 +56,39 @@ To add a new department: 1. In the **Departments** grid, press **New Department**. 2. Enter a name. -3. In *Quota management* configure the number GPUs, CPUs, and CPU memory. -4. In *Access control* select a user or application to be department administrator. If there are no users assigned the role of department administrator, see [Assigning Department Administrator role](#assigning-department-administrator-role). +3. In *Quota management* configure the number GPUs, CPUs, and CPU memory, then press *Save*. + + ### Assigning Department Administrator role +There are two ways to add *Department Administrator* roles to a department. + +The first is through the *Users* UI, and the second is through the *Access rules* that you can assign to a department. + +#### Users UI + You can create a new user with the *Department Administrator* role, or add the role to existing users. To create a new user with this role, see [Create a user](admin-ui-users.md#create-a-user). To add this role to an existing user: -1. Go to `Settings | Users`. -2. Select a user from the list and then press `Edit User`. -3. Select the `Department Admin` role from the list. (Deselect to remove the role from the user). -4. Press save when complete. +1. Press the ![Tools and Settings](img/tools-and-settings.svg) icon, then select *Users*.. +2. Select a user, then press *Access rules*, then press *+Access rule*. +3. Select the `Department Administrator` role from the list. +4. Press on the ![Scope](../../images/scope-icon.svg) and select one or more departments. +5. Press *Save rule* and then *Close*. -After you have created the user with the Department Administrator role, you will need to assign the user to the correct department. +#### Assigning the access rule to the department + +To assign the *Access rule* to the department: + +1. Select a department from the list, then press *Access rules*, then press then press *+Access rule*. +2. From the *Subject type* dropdown choose *User* or *Application*, then enter the user name or the application name. +3. From the *Role* dropdown, select *Department administrator*, then press *Save rule*. +4. If you want to add another rule, use the *+Access rule*. +5. When all the rules are configured, press *Close*. + + ### Assigning Projects to Departments diff --git a/docs/admin/admin-ui-setup/img/settings-icon.png b/docs/admin/admin-ui-setup/img/settings-icon.png new file mode 100644 index 0000000000..01602ffc34 Binary files /dev/null and b/docs/admin/admin-ui-setup/img/settings-icon.png differ diff --git a/docs/admin/admin-ui-setup/img/tools-and-settings.svg b/docs/admin/admin-ui-setup/img/tools-and-settings.svg new file mode 100644 index 0000000000..bb471df02e --- /dev/null +++ b/docs/admin/admin-ui-setup/img/tools-and-settings.svg @@ -0,0 +1,3 @@ + + + diff --git a/docs/admin/admin-ui-setup/project-setup.md b/docs/admin/admin-ui-setup/project-setup.md index f449e6456d..d823322c23 100644 --- a/docs/admin/admin-ui-setup/project-setup.md +++ b/docs/admin/admin-ui-setup/project-setup.md @@ -46,12 +46,10 @@ As an administrator, you may want to disconnect the two parameters. So, for exam !!! Note To be able to create or edit Projects, you must have *Editor* access. See the [Users](admin-ui-users.md) documentation. -1. In the left-menu, press **Projects**. -1.5 On the top right, select "Add New Project" +1. In the left-menu, press **Projects**, then press *+Add New Project*. 2. Choose a *Department* from the drop-down. The default is `default`. 3. Enter a *Project name*. Press *Namespace* to set the namespace associated with the project. You can either create the namespace from the project name (default) or enter an existing namespace. -4. In *Access control*, add one or more applications or users. If your user or application isn't in the list, see [Roles and permissions](admin-ui-users.md#roles-and-permissions), and verify that the users have the correct permissions. To change user permissions, see [Working with users](admin-ui-users.md#working-with-users). -5. In *Quota management*, configure the node pool priority (if editable), the GPUs, CPUs, CPU memory, and Over-quota priority settings. Configure the following: +4. In *Quota management*, configure the node pool priority (if editable), the GPUs, CPUs, CPU memory, and Over-quota priority settings. Configure the following: * *Order of priority*—the priority the node pool will receive when trying to schedule workloads. For more information, see [Node pool priority](../../Researcher/scheduling/using-node-pools.md#multiple-node-pools-selection). * *GPUs*—the number of GPUs in the node pool. Press *GPUs* and enter the number of GPUs, then press *Apply* to save. @@ -59,16 +57,27 @@ As an administrator, you may want to disconnect the two parameters. So, for exam * *CPU Memory*—the amount of memory the CPUs will be allocated. Press *CPU Memory*, enter an amount of memory, then press *Apply* to save. * Over-quota priority—the priority for the specific node pool to receive over-quota allocations. -6. (Optional) In the *Scheduling rules* pane, use the dropdown arrow to open the pane. Press on the *+ Rule* button to add a new rule to the project. Add one (or more) of the following rule types: +5. (Optional) In the *Scheduling rules* pane, use the dropdown arrow to open the pane. Press on the *+ Rule* button to add a new rule to the project. Add one (or more) of the following rule types: * *Idle GPU timeout*—controls the amount of time that specific workload GPUs which are idle will be remain assigned to the project before getting reassigned. * *Workspace duration*—limit the length of time a workspace will before being terminated. * *Training duration*—limit the length of time training workloads will run. * *Node type (Affinity)*—limits specific workloads to run on specific node types. + + + +## Assign users to a Project + + -## Assign Users to Project +To assign *Access rules* to the project: -When [Researcher Authentication](../runai-setup/authentication/researcher-authentication.md) is enabled, the Project form will contain an additional *Access Control* tab. The tab will allow you to assign Researchers to their Projects. +1. Select a project from the list, then press *Access rules*, then press then press *+Access rule*. +2. From the *Subject type* dropdown choose *User* or *Application*, then enter the user name or the application name. +3. From the *Role* dropdown, select the desired role, then press *Save rule*. +4. If you want to add another rule, use the *+Access rule*. +5. When all the rules are configured, press *Close*. If you are using Single-sign-on, you can also assign Groups @@ -175,9 +184,9 @@ To set a duration limit for interactive Jobs: * Create a Project or edit an existing Project. * Go to the *Time Limit* tab * You can limit interactive Jobs using two criteria: - * Set a hard time limit (day, hour, minute) to an Interactive Job, regardless of the activity of this Job, e.g. stop the Job after 1 day of work. - * Set a time limit for Idle Interactive Jobs, i.e. an Interactive Job idle for X time is stopped. Idle means no GPU activity. - * You can set if this idle time limit is effective for Interactive Jobs that are Preemptible, non-Preemptible, or both. + * Set a hard time limit (day, hour, minute) to an Interactive Job, regardless of the activity of this Job, e.g. stop the Job after 1 day of work. + * Set a time limit for Idle Interactive Jobs, i.e. an Interactive Job idle for X time is stopped. Idle means no GPU activity. + * You can set if this idle time limit is effective for Interactive Jobs that are Preemptible, non-Preemptible, or both. The setting only takes effect for Jobs that have started after the duration has been changed. @@ -187,7 +196,7 @@ To set a duration limit for Training Jobs: * Create a Project or edit an existing Project. * Go to the *Time Limit* tab: - * Set a time limit for Idle Training Jobs, i.e. a Training Job idle for X time is stopped. Idle means no GPU activity. + * Set a time limit for Idle Training Jobs, i.e. a Training Job idle for X time is stopped. Idle means no GPU activity. The setting only takes effect for Jobs that have started after the duration has been changed. diff --git a/docs/admin/integration/ray.md b/docs/admin/integration/ray.md index fb4ab52634..78ca42c443 100644 --- a/docs/admin/integration/ray.md +++ b/docs/admin/integration/ray.md @@ -2,7 +2,7 @@ Ray is an open-source unified framework for scaling AI and Python applications like machine learning. It provides the compute layer for parallel processing so that you don’t need to be a distributed systems expert. -## Sumitting Ray jobs +## Install Ray operator You must install KubeRay version 0.5.0 or greater in order to work with the different types of Ray workloads. @@ -16,7 +16,8 @@ helm install kuberay-operator kuberay/kuberay-operator -n kuberay-operator --ver For more information, see [Deploying RayKube operator](https://docs.ray.io/en/latest/cluster/kubernetes/getting-started.html#deploying-the-kuberay-operator){target=_blank}. -## Submit Ray jobs +## Submit a Ray job + Run:AI integrates with ray by interacting with the kuberay CRDs (RayJob, RayServe and RayCluster). The following is an example of RayJob scheduled by Run:AI. Use the following command to submit your Ray job: diff --git a/docs/admin/runai-setup/access-control/rbac.md b/docs/admin/runai-setup/access-control/rbac.md new file mode 100644 index 0000000000..bddfdb73d2 --- /dev/null +++ b/docs/admin/runai-setup/access-control/rbac.md @@ -0,0 +1,109 @@ +# Role based access control + +User authorization to system resources and assets is managed using [Role-based access control(RBAC)](https://en.wikipedia.org/wiki/Role-based_access_control){target=_blank}. RBAC is a policy-neutral access control mechanism defined around roles and privileges. The components of RBAC make it simple to manage access to system resources and assets. + +## RBAC components + +Run:ai uses the following components for RBAC: + +### Subjects + +A *Subject* is an entity that receives the rule. *Subjects* are: + +* Users +* Applications +* Groups (SSO only) + +### Roles + +A role is a combination of entities and actions. Run:ai supports the following roles and actions within the user's granted scope: + +| Managed Entity | System Admin (1) | Department Admin (4) | Editor (5) | Research Manager | Researcher | ML Eng. | Viewer | Researcher L1 | Researcher L2 | Environments Admin | Data Sources Admin | Compute Resources Admin | Templates Admin | Department Viewer | +|-----------------------------------------------------------------------|------------------|----------------------|------------|------------------|------------|---------|--------|---------------|---------------|--------------------|--------------------|-------------------------|-----------------|-------------------| +| Create local users and applications | CRUD | CRUD | | | | | | | | | | | | | +| Assign Users/Groups/Apps to Roles with scopes (Departments, Projects) | CRUD | CRUD | CRUD | | | | | | | | | | | | +| Roles | CRUD | R | R | | | | | | | | | | | | +| Departments | CRUD | R (6) | CRUD | | | R | R | | | R | R | R | R | R | +| Projects | CRUD | CRUD | CRUD | R (2) (3) | R | R | R | R | CRUD | R | R | R | R | R | +| Jobs | CRUD | CRUD | CRUD | R | CRUD | | R | CRUD | CRUD | R | R | R | R | R | +| Deployments | CRUD | CRUD | R | | | CRUD | R | | | | | | | R | +| Workspaces | CRUD | CRUD | CRUD | R | CRUD | | R | CRUD | CRUD | R | R | R | R | R | +| Trainings | CRUD | CRUD | CRUD | R | CRUD | | R | CRUD | | R | R | R | R | R | +| Environments | CRUD | CRUD | CRUD | CRUD | CRUD | | R | R | R | CRUD | R | R | R | R | +| Data Sources | CRUD | CRUD | CRUD | CRUD | CRUD | | R | R | R | R | CRUD | R | R | R | +| Compute Resources | CRUD | CRUD | CRUD | CRUD | CRUD | | R | R | R | R | R | CRUD | R | R | +| Templates | CRUD | CRUD | CRUD | CRUD | CRUD | | R | R | R | R | R | R | CRUD | R | +| Policies (7) | CRUD | CRUD | R | R | R | R | R | R | | R | R | R | R | R | +| Clusters | CRUD | R | R | R | R | R | R | R | | R | R | R | R | R | +| Node Pools | CRUD | R | R | | | R | R | | | | | | | | +| Nodes | R | R | R | | | R | R | | | | | | | | +| Settings.General | CRUD | | | | | | | | | | | | | | +| Credentials (Settings.Cre...) | CRUD | R | R | R | R | R | R | R | | | R | | | | +| Events History | R | | | | | | | | | | | | | | +| Dashboard.Overview | R | R | R | R | R | R | R | R | R | R | R | R | R | R | +| Dashboards.Analytics | R | R | R | R | R | R | R | R | R | R | R | R | R | R | +| Dashboards.Consumption | R | R | | | | | | R | R | | | | | | + +Permissions: **C** = Create, **R** = Read, **U** = Update, **D** = Delete + +!!!Note + + 1. *Admin* becomes *System Admin* with full access to all managed objects and scopes. + 2. *Research Manager* is **not** automatically assigned to all projects but to Projects set by the relevant *Admin* when assigning this role to a user, group, or app. + 3. To preserve backward compatibility, users with the role of *Research Manager* are assigned to all current projects, but not to new projects. + 4. To allow the *Department Admin* to assign a *Researcher* role to a user, group, or app, the *Department Admin* must have **CRUD** permissions for **Jobs** and **Workspaces**. This creates a broader span of managed objects. + 5. To preserve backward compatibility, users with the role *Editor*, are assigned to the same scope they had before the upgrade. However, with new user assignments, the *Admin* can limit the scope to only part of the organizational scope. + 6. *Department Admin* permissions for **Departments** remain **Read** as long as there is no hierarchy. Once a hierarchy is introduced, permissions need to change to **CRUD** to allow the *Department Admin* to create new Departments under its own department. + 7. Policies are accessible through **Clusters** using YAML files. There is no UI interface, although these policies affect UI elements (for example, Job Forms, Workspaces, Trainings). + +### Scope + +A *Scope* is an organizational component which accessible based on assigned roles. *Scopes* include: + +* Projects +* Departments +* Clusters +* Tenant (all clusters) + +### Asset + +RBAC uses [rules](#access-rules) to ensure that only authorized users or applications can gain access to system assets. Assets that can have RBAC rules applied are: + +* Departments +* Projects +* Deployments +* Workspaces +* Environments +* Quota management dashboard +* Training + +### RBAC enforcement + +RBAC ensures that user have access to system assets based on the rules that are applied to those assets. Should an asset be part of a larger scope of assets to which the user does not have access. The scope shown to the user will appear to be incomplete because the user is able to access **only** the assets to which they are authorized. + +## Access rules + +An *Access rule* is the assignment of a *Role* to a *Subject* in a *Scope*. *Access rules* are expressed as follows: + +` is a in a `. + +**For example**: +User **user@domain.com** is a **department admin** in **Department A**. + +### Create or delete rules + +To create a new access rule: + +1. Press the ![Tools and Settings](../../admin-ui-setup/img/tools-and-settings.svg) icon, then *Roles and Access rules*. +2. Choose *Access rules*, then *New access rule*. +3. Select a user type from the dropdown. +4. Select a role from the dropdown. +5. Press the ![Scope](../../../images/scope-icon.svg) icon and select a scope, and press *Save rule* when done. + +!!! Note + You cannot edit *Access rules*. To change an *Access rule*, you need to delete the rule, create a new rule to replace it. You can also add multiple rules for the same user. + +To delete a rule: + +1. Press the ![Tools and Settings](../../admin-ui-setup/img/tools-and-settings.svg) icon, then *Roles and Access rules*. +2. Choose *Access rules*, then select a rule and press *Delete*. diff --git a/docs/admin/runai-setup/authentication/researcher-authentication.md b/docs/admin/runai-setup/authentication/researcher-authentication.md index ee4e7f078d..698dd604f7 100644 --- a/docs/admin/runai-setup/authentication/researcher-authentication.md +++ b/docs/admin/runai-setup/authentication/researcher-authentication.md @@ -77,8 +77,6 @@ Modifying the API Server configuration differs between Kubernetes distributions: If working via Rancher UI, need to add the flag as part of the cluster provisioning. Under `Cluster Management | Create`, turn on RKE2 and select a platform. Under `Cluster Configuration | Advanced | Additional API Server Args`. Add the Run:ai flags as `=` (e.g. `oidc-username-prefix=-`). - - At the time of writing, the flags cannot be changed after the cluster has been provisioned due to a Rancher bug. === "GKE" Install [Anthos identity service](https://cloud.google.com/kubernetes-engine/docs/how-to/oidc#enable-oidc){target=_blank} by running: diff --git a/docs/admin/runai-setup/cluster-setup/cluster-install.md b/docs/admin/runai-setup/cluster-setup/cluster-install.md index 6645aefe54..42bbdcd2f0 100644 --- a/docs/admin/runai-setup/cluster-setup/cluster-install.md +++ b/docs/admin/runai-setup/cluster-setup/cluster-install.md @@ -31,11 +31,7 @@ Using the Wizard: * Go to `.run.ai/dashboards/now`. * Verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line. - - -:octicons-versions-24: Version 2.9 and up - -Run: `kubectl get cm runai-public -n runai -o jsonpath='{.data}' | yq -P` +* Run: `kubectl get cm runai-public -n runai -o jsonpath='{.data}' | yq -P` (assumes the [yq](https://mikefarah.gitbook.io/yq/v/v3.x/){target=_blank} is instaled) diff --git a/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md b/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md index 015767ea2d..11c881bc85 100644 --- a/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md +++ b/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md @@ -1,4 +1,4 @@ -Below are the prerequisites of a cluster installed with Run:ai. +Below are the prerequisites of a cluster installed with Run:ai. ## Prerequisites in a Nutshell @@ -6,42 +6,42 @@ The following is a checklist of the Run:ai prerequisites: | Prerequisite | Details | |--------------|---------| -| [Kubernetes](#kubernetes) | Verify certified vendor and correct version. | +| [Kubernetes](#kubernetes) | Verify certified vendor and correct version. | | [NVIDIA GPU Operator](#nvidia) | Different Kubernetes flavors have slightly different setup instructions.
Verify correct version. | -| [Ingress Controller](#ingress-controller) | Install and configure NGINX (some Kubernetes flavors have NGINX pre-installed). | -| [Prometheus](#prometheus) | Install Prometheus. | -| [Trusted domain name](#cluster-url) | You must provide a trusted domain name. Accessible only inside the organization | -| (Optional) [Distributed Training](#distributed-training) | Install Kubeflow Training Operator if required. | -| (Optional) [Inference](#inference) | Some third party software needs to be installed to use the Run:ai inference module. | +| [Ingress Controller](#ingress-controller) | Install and configure NGINX (some Kubernetes flavors have NGINX pre-installed). | +| [Prometheus](#prometheus) | Install Prometheus. | +| [Trusted domain name](#cluster-url) | You must provide a trusted domain name. Accessible only inside the organization | +| (Optional) [Distributed Training](#distributed-training) | Install Kubeflow Training Operator if required. | +| (Optional) [Inference](#inference) | Some third party software needs to be installed to use the Run:ai inference module. | -There are also specific [hardware](#hardware-requirements), [operating system](#operating-system) and [network access](#network-access-requirements) requirements. A [pre-install](#pre-install-script) script is available to test if the prerequisites are met before installation. +There are also specific [hardware](#hardware-requirements), [operating system](#operating-system) and [network access](#network-access-requirements) requirements. A [pre-install](#pre-install-script) script is available to test if the prerequisites are met before installation. ## Software Requirements ### Operating System -* Run:ai will work on any __Linux__ operating system that is supported by __both__ Kubernetes and [NVIDIA](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html){target=_blank}. +* Run:ai will work on any __Linux__ operating system that is supported by __both__ Kubernetes and [NVIDIA](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html){target=_blank}. * An important highlight is that GKE (Google Kubernetes Engine) will only work with Ubuntu, as NVIDIA [does not support](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html#about-using-the-operator-with-google-gke) the default _Container-Optimized OS with Containerd_ image. -* Run:ai performs its internal tests on Ubuntu 20.04 and CoreOS for OpenShift. +* Run:ai performs its internal tests on Ubuntu 20.04 and CoreOS for OpenShift. ### Kubernetes -Run:ai requires Kubernetes. Run:ai is been certified with the following Kubernetes distributions: +Run:ai requires Kubernetes. Run:ai is been certified with the following Kubernetes distributions: -| Kubernetes Distribution | Description | Installation Notes | +| Kubernetes Distribution | Description | Installation Notes | |-----------------------------------|-------------|--------------------| | Vanilla Kubernetes | Using no specific distribution but rather k8s native installation | See instructions for a simple (non-production-ready) [Kubernetes Installation](install-k8s.md) script. | -| OCP | OpenShift Container Platform | The Run:ai operator is [certified](https://catalog.redhat.com/software/operators/detail/60be3acc3308418324b5e9d8){target=_blank} for OpenShift by Red Hat. | +| OCP | OpenShift Container Platform | The Run:ai operator is [certified](https://catalog.redhat.com/software/operators/detail/60be3acc3308418324b5e9d8){target=_blank} for OpenShift by Red Hat. | | EKS | Amazon Elastic Kubernetes Service | | | AKS | Azure Kubernetes Services | | -| GKE | Google Kubernetes Engine | | -| RKE | Rancher Kubernetes Engine | When installing Run:ai, select _On Premise_. RKE2 has a defect which requires a specific installation flow. Please contact Run:ai customer support for additional details. | +| GKE | Google Kubernetes Engine | | +| RKE | Rancher Kubernetes Engine | When installing Run:ai, select _On Premise_ | | Bright | [NVIDIA Bright Cluster Manager](https://www.nvidia.com/en-us/data-center/bright-cluster-manager/){target=_blank} | In addition, NVIDIA DGX comes [bundled](dgx-bundle.md) with Run:ai | -Run:ai has been tested with the following Kubernetes distributions. Please contact Run:ai Customer Support for up to date certification details: +Run:ai has been tested with the following Kubernetes distributions. Please contact Run:ai Customer Support for up to date certification details: -| Kubernetes Distribution | Description | Installation Notes | +| Kubernetes Distribution | Description | Installation Notes | |-----------------------------------|-------------|--------------------| | Ezmeral | HPE Ezmeral Container Platform | See Run:ai at [Ezmeral marketplace](https://www.hpe.com/us/en/software/marketplace/runai.html){target=_blank} | | Tanzu | VMWare Kubernetes | Tanzu supports _containerd_ rather than _docker_. See the NVIDIA prerequisites below as well as [cluster customization](customize-cluster-install.md) for changes required for containerd | @@ -63,15 +63,16 @@ For an up-to-date end-of-life statement of Kubernetes see [Kubernetes Release Hi Run:ai does not support [Pod Security Admission](https://kubernetes.io/docs/concepts/security/pod-security-admission/){target=_blank}. Support for [Pod Security Policy](https://kubernetes.io/docs/concepts/policy/pod-security-policy/){target=_blank} has been removed with Run:ai 2.9. -### NVIDIA +### NVIDIA -Run:ai has been certified on __NVIDIA GPU Operator__ 22.9 to 23.3. Older versions (1.10 and 1.11) have a documented [NVIDIA issue](https://github.com/NVIDIA/gpu-feature-discovery/issues/26){target=_blank}. +Run:ai has been certified on __NVIDIA GPU Operator__ 22.9 to 23.3. Older versions (1.10 and 1.11) have a documented [NVIDIA issue](https://github.com/NVIDIA/gpu-feature-discovery/issues/26){target=_blank}. Follow the [Getting Started guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-nvidia-gpu-operator){target=blank} to install the NVIDIA GPU Operator, or see the distribution-specific instructions below: === "EKS" + * When setting up EKS, do not install the NVIDIA device plug-in (as we want the NVIDIA GPU Operator to install it instead). When using the [eksctl](https://eksctl.io/){target=_blank} tool to create an AWS EKS cluster, use the flag `--install-nvidia-plugin=false` to disable this install. - * Follow the [Getting Started guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-nvidia-gpu-operator){target=blank} to install the NVIDIA GPU Operator. For GPU nodes, EKS uses an AMI which already contains the NVIDIA drivers. As such, you must use the GPU Operator flags: `--set driver.enabled=false`. + * Follow the [Getting Started guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-nvidia-gpu-operator){target=blank} to install the NVIDIA GPU Operator. For GPU nodes, EKS uses an AMI which already contains the NVIDIA drivers. As such, you must use the GPU Operator flags: `--set driver.enabled=false`. === "GKE" @@ -105,24 +106,25 @@ Follow the [Getting Started guide](https://docs.nvidia.com/datacenter/cloud-nati !!! Important * Run:ai on GKE has only been tested with GPU Operator version 22.9 and up. * The above only works for Run:ai 2.7.16 and above. - + !!! Notes * Use the default namespace `gpu-operator`. Otherwise, you must specify the target namespace using the flag `runai-operator.config.nvidiaDcgmExporter.namespace` as described in [customized cluster installation](customize-cluster-install.md). - * NVIDIA drivers may already be installed on the nodes. In such cases, use the NVIDIA GPU Operator flags `--set driver.enabled=false`. [DGX OS](https://docs.nvidia.com/dgx/index.html){target=_blank} is one such example as it comes bundled with NVIDIA Drivers. + * NVIDIA drivers may already be installed on the nodes. In such cases, use the NVIDIA GPU Operator flags `--set driver.enabled=false`. [DGX OS](https://docs.nvidia.com/dgx/index.html){target=_blank} is one such example as it comes bundled with NVIDIA Drivers. * To use [Dynamic MIG](../../../Researcher/scheduling/fractions.md#dynamic-mig), the GPU Operator must be installed with the flag `mig.strategy=mixed`. If the GPU Operator is already installed, edit the clusterPolicy by running ```kubectl patch clusterPolicy cluster-policy -n gpu-operator --type=merge -p '{"spec":{"mig":{"strategy": "mixed"}}}``` ### Ingress Controller -Run:ai requires an ingress controller as a prerequisite. The Run:ai cluster installation configures one or more ingress objects on top of the controller. +Run:ai requires an ingress controller as a prerequisite. The Run:ai cluster installation configures one or more ingress objects on top of the controller. There are many ways to install and configure an ingress controller and configuration is environment-dependent. A simple solution is to install & configure _NGINX_: -=== "On Prem" +=== "On Prem" ``` bash helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx @@ -136,11 +138,11 @@ There are many ways to install and configure an ingress controller and configura 1. External and internal IP of one of the nodes === "RKE" - RKE and RKE2 come pre-installed with NGINX. No further action needs to be taken. + RKE and RKE2 come pre-installed with NGINX. No further action needs to be taken. === "Managed Kubernetes" - For managed Kubernetes such as EKS: + For managed Kubernetes such as EKS: ``` bash helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx @@ -149,7 +151,7 @@ There are many ways to install and configure an ingress controller and configura --namespace nginx-ingress --create-namespace ``` -For support of ingress controllers different than NGINX please contact Run:ai customer support. +For support of ingress controllers different than NGINX please contact Run:ai customer support. !!! Note In a self-hosted installation, the typical scenario is to install the first Run:ai cluster on the same Kubernetes cluster as the control plane. In this case, there is no need to install an ingress controller as it is pre-installed by the control plane. @@ -172,17 +174,17 @@ kubectl create secret tls runai-cluster-domain-tls-secret -n runai \ ``` 1. The domain's cert (public key). -2. The domain's private key. +2. The domain's private key. For more information on how to create a TLS secret see: [https://kubernetes.io/docs/concepts/configuration/secret/#tls-secrets](https://kubernetes.io/docs/concepts/configuration/secret/#tls-secrets){target=_blank}. !!! Note - In a self-hosted installation, the typical scenario is to install the first Run:ai cluster on the same Kubernetes cluster as the control plane. In this case, the cluster URL need not be provided as it will be the same as the control-plane URL. + In a self-hosted installation, the typical scenario is to install the first Run:ai cluster on the same Kubernetes cluster as the control plane. In this case, the cluster URL need not be provided as it will be the same as the control-plane URL. -### Prometheus +### Prometheus -If not already installed on your cluster, install the full `kube-prometheus-stack` through the [Prometheus community Operator](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack){target=_blank}. +If not already installed on your cluster, install the full `kube-prometheus-stack` through the [Prometheus community Operator](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack){target=_blank}. !!! Note * If Prometheus has been installed on the cluster in the past, even if it was uninstalled (such as when upgrading from Run:ai 2.8 or lower), you will need to update Prometheus CRDs as described [here](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#upgrading-chart){target=_blank}. For more information on the Prometheus bug see [here](https://github.com/prometheus-community/helm-charts/issues/2753){target=_blank}. @@ -197,7 +199,7 @@ helm install prometheus prometheus-community/kube-prometheus-stack \ -n monitoring --create-namespace --set grafana.enabled=false # (1) ``` -1. The Grafana component is not required for Run:ai. +1. The Grafana component is not required for Run:ai. ## Optional Software Requirements @@ -220,7 +222,7 @@ kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/stand To use the Run:ai inference module you must pre-install [Knative Serving](https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/){target=_blank}. Follow the instructions [here](https://knative.dev/docs/install/){target=_blank} to install. Run:ai is certified on Knative 1.4 to 1.8 with Kubernetes 1.22 or later. -Post-install, you must configure Knative to use the Run:ai scheduler and allow pod affinity, by running: +Post-install, you must configure Knative to use the Run:ai scheduler and allow pod affinity, by running: ``` kubectl patch configmap/config-features \ @@ -236,7 +238,7 @@ Run:ai allows to autoscale a deployment according to various metrics: 2. CPU Utilization (%) 3. Latency (milliseconds) 4. Throughput (requests/second) -5. Concurrency +5. Concurrency 6. Any custom metric Additional installation may be needed for some of the metrics as follows: @@ -245,12 +247,12 @@ Additional installation may be needed for some of the metrics as follows: * Any other metric will require installing the [HPA Autoscaler](https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/#install-optional-serving-extensions){target=_blank}. * Using _GPU Utilization_, _Latency_ or _Custom metric_ will __also__ require the Prometheus adapter. The Prometheus adapter is part of the Run:ai installer and can be added by setting the `prometheus-adapter.enabled` flag to `true`. See [Customizing the Run:ai installation](./customize-cluster-install.md) for further information. -If you wish to use an _existing_ Prometheus adapter installation, you will need to configure it manually with the Run:ai Prometheus rules, specified in the Run:ai chart values under `prometheus-adapter.rules` field. For further information please contact Run:ai customer support. +If you wish to use an _existing_ Prometheus adapter installation, you will need to configure it manually with the Run:ai Prometheus rules, specified in the Run:ai chart values under `prometheus-adapter.rules` field. For further information please contact Run:ai customer support. #### Accessing Inference from outside the Cluster -Inference workloads will typically be accessed by consumers residing outside the cluster. You will hence want to provide consumers with a URL to access the workload. The URL can be found in the Run:ai user interface under the deployment screen (alternatively, run `kubectl get ksvc -n `). +Inference workloads will typically be accessed by consumers residing outside the cluster. You will hence want to provide consumers with a URL to access the workload. The URL can be found in the Run:ai user interface under the deployment screen (alternatively, run `kubectl get ksvc -n `). However, for the URL to be accessible outside the cluster you must configure your DNS as described [here](https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/#configure-dns){target=_blank}. @@ -270,43 +272,43 @@ However, for the URL to be accessible outside the cluster you must configure you (see picture below) * (Production only) __Run:ai System Nodes__: To reduce downtime and save CPU cycles on expensive GPU Machines, we recommend that production deployments will contain __two or more__ worker machines, designated for Run:ai Software. The nodes do not have to be dedicated to Run:ai, but for Run:ai purposes we would need: - + * 8 CPUs * 16GB of RAM * 50GB of Disk space - + * __Shared data volume:__ Run:ai uses Kubernetes to abstract away the machine on which a container is running: - * Researcher containers: The Researcher's containers need to be able to access data from any machine in a uniform way, to access training data and code as well as save checkpoints, weights, and other machine-learning-related artifacts. + * Researcher containers: The Researcher's containers need to be able to access data from any machine in a uniform way, to access training data and code as well as save checkpoints, weights, and other machine-learning-related artifacts. * The Run:ai system needs to save data on a storage device that is not dependent on a specific node. - Typically, this is achieved via Kubernetes Storage class based on Network File Storage (NFS) or Network-attached storage (NAS). + Typically, this is achieved via Kubernetes Storage class based on Network File Storage (NFS) or Network-attached storage (NAS). * __Docker Registry:__ With Run:ai, Workloads are based on Docker images. For container images to run on any machine, these images must be downloaded from a docker registry rather than reside on the local machine (though this also is [possible](../../../researcher-setup/docker-to-runai/#image-repository)). You can use a public registry such as [docker hub](https://hub.docker.com/){target=_blank} or set up a local registry on-prem (preferably on a dedicated machine). Run:ai can assist with setting up the repository. -* __Kubernetes:__ Production Kubernetes installation requires separate nodes for the Kubernetes master. For more details see your specific Kubernetes distribution documentation. +* __Kubernetes:__ Production Kubernetes installation requires separate nodes for the Kubernetes master. For more details see your specific Kubernetes distribution documentation. ![img/prerequisites.png](img/prerequisites.jpg) ## User requirements -__Usage of containers and images:__ The individual Researcher's work must be based on [container](https://www.docker.com/resources/what-container){target=_blank} images. +__Usage of containers and images:__ The individual Researcher's work must be based on [container](https://www.docker.com/resources/what-container){target=_blank} images. ## Network Access Requirements -__Internal networking:__ Kubernetes networking is an add-on rather than a core part of Kubernetes. Different add-ons have different network requirements. You should consult the documentation of the specific add-on on which ports to open. It is however important to note that unless special provisions are made, Kubernetes assumes __all__ cluster nodes can interconnect using __all__ ports. +__Internal networking:__ Kubernetes networking is an add-on rather than a core part of Kubernetes. Different add-ons have different network requirements. You should consult the documentation of the specific add-on on which ports to open. It is however important to note that unless special provisions are made, Kubernetes assumes __all__ cluster nodes can interconnect using __all__ ports. -__Outbound network:__ Run:ai user interface runs from the cloud. All container nodes must be able to connect to the Run:ai cloud. Inbound connectivity (connecting from the cloud into nodes) is not required. If outbound connectivity is limited, the following exceptions should be applied: +__Outbound network:__ Run:ai user interface runs from the cloud. All container nodes must be able to connect to the Run:ai cloud. Inbound connectivity (connecting from the cloud into nodes) is not required. If outbound connectivity is limited, the following exceptions should be applied: ### During Installation -Run:ai requires an installation over the Kubernetes cluster. The installation access the web to download various images and registries. Some organizations place limitations on what you can pull from the internet. The following list shows the various solution components and their origin: +Run:ai requires an installation over the Kubernetes cluster. The installation access the web to download various images and registries. Some organizations place limitations on what you can pull from the internet. The following list shows the various solution components and their origin: | Name | Description | URLs | Ports | |------|-------------|------|-------| -|Run:ai Repository| Run:ai Helm Package Repository| runai-charts.storage.googleapis.com |443 | -| Docker Images Repository | Run:ai images | gcr.io/run-ai-prod |443 | +|Run:ai Repository| Run:ai Helm Package Repository| runai-charts.storage.googleapis.com |443 | +| Docker Images Repository | Run:ai images | gcr.io/run-ai-prod |443 | | Docker Images Repository | Third party Images |hub.docker.com and quay.io | 443 | | Run:ai | Run:ai Cloud instance | app.run.ai | |443, 53 | @@ -317,10 +319,10 @@ Run:ai requires an installation over the Kubernetes cluster. The installation ac In addition, once running, Run:ai requires an outbound network connection to the following targets: -| Name | Description | URLs | Ports | +| Name | Description | URLs | Ports | |------|-------------|------|-------| | Grafana |Grafana Metrics Server | prometheus-us-central1.grafana.net and runailabs.com |443 | -| Run:ai | Run:ai Cloud instance | app.run.ai |443, 53 | +| Run:ai | Run:ai Cloud instance | app.run.ai |443, 53 | ### Network Proxy @@ -332,7 +334,7 @@ If you are using a Proxy for outbound communication please contact Run:ai custom Once you believe that the Run:ai prerequisites are met, we highly recommend installing and running the Run:ai [pre-install diagnostics script](https://github.com/run-ai/preinstall-diagnostics){target=_blank}. The tool: * Tests the below requirements as well as additional failure points related to Kubernetes, NVIDIA, storage, and networking. -* Looks at additional components installed and analyze their relevance to a successful Run:ai installation. +* Looks at additional components installed and analyze their relevance to a successful Run:ai installation. To use the script [download](https://github.com/run-ai/preinstall-diagnostics/releases){target=_blank} the latest version of the script and run: @@ -341,7 +343,7 @@ chmod +x preinstall-diagnostics- ./preinstall-diagnostics- ``` -If the script shows warnings or errors, locate the file `runai-preinstall-diagnostics.txt` in the current directory and send it to Run:ai technical support. +If the script shows warnings or errors, locate the file `runai-preinstall-diagnostics.txt` in the current directory and send it to Run:ai technical support. For more information on the script including additional command-line flags, see [here](https://github.com/run-ai/preinstall-diagnostics){target=_blank}. diff --git a/docs/admin/runai-setup/cluster-setup/customize-cluster-install.md b/docs/admin/runai-setup/cluster-setup/customize-cluster-install.md index 70e3b82413..930c5799d7 100644 --- a/docs/admin/runai-setup/cluster-setup/customize-cluster-install.md +++ b/docs/admin/runai-setup/cluster-setup/customize-cluster-install.md @@ -13,30 +13,10 @@ The Run:ai cluster creation wizard requires the download of a _Helm values file_ | `runai-operator.config.global.runtime` | `docker` | Defines the container runtime of the cluster (supports `docker` and `containerd`). Set to `containerd` when using Tanzu | | `runai-operator.config.global.nvidiaDcgmExporter.namespace` | `gpu-operator` | The namespace where dcgm-exporter (or gpu-operator) was installed | | `runai-operator.config.global.nvidiaDcgmExporter.installedFromGpuOperator` | `true` | Indicated whether the dcgm-exporter was installed via gpu-operator or not | -| `kube-prometheus-stack.enabled` | `true` | (Version 2.8 or lower) Set to `false` when the cluster has an existing Prometheus installation that is __not based__ on the Prometheus __operator__. This setting requires Run:ai customer support | -| `kube-prometheus-stack.prometheusOperator.enabled` | `true` | (Version 2.8 or lower) Set to `false` when the cluster has an existing Prometheus installation __based__ on the Prometheus __operator__ and Run:ai should use the existing one rather than install a new one | -| `prometheus-adapter.enabled` | `false` | (Version 2.8 or lower) Install Prometheus Adapter. Used for Inference workloads using a custom metric for autoscaling. Set to `true` if __Prometheus Adapter__ is not already installed in the cluster | -| `prometheus-adapter.prometheus` | The address of the default Prometheus Service | (Version 2.8 or lower) If you installed your own custom Prometheus Service, set this field accordingly with `url` and `port` | -### Prometheus - -=== "Version 2.9 or higher" - Not relevant - -=== "Version 2.8 or lower" - The Run:ai Cluster installation uses [Prometheus](https://prometheus.io/){target=_blank}. There are 3 alternative configurations: - - 1. Run:ai installs Prometheus (default). - 2. Run:ai uses an existing Prometheus installation based on the Prometheus operator. - 3. Run:ai uses an existing Prometheus installation based on a regular Prometheus installation. - - For option 2, disable the flag `kube-prometheus-stack.prometheusOperator.enabled`. For option 3, please contact Run:ai Customer support. - - For options 2 and 3, if you enabled `prometheus-adapter`, please configure it as described in the Prometheus Adapter [documentation](https://github.com/prometheus-community/helm-charts/blob/97f23f1ff7ca62f33ab4dd339cc62addec7eccde/charts/prometheus-adapter/values.yaml#L34) - ## Understanding Custom Access Roles diff --git a/docs/admin/runai-setup/config/allow-external-access-to-containers.md b/docs/admin/runai-setup/config/allow-external-access-to-containers.md index 6564d15769..18811c834c 100644 --- a/docs/admin/runai-setup/config/allow-external-access-to-containers.md +++ b/docs/admin/runai-setup/config/allow-external-access-to-containers.md @@ -37,37 +37,36 @@ To address this issue, Run:ai provides support for __host-based routing__. When To enable host-based routing you must perform the following steps: -1. Create a second DNS entry `*.`, pointing to the same IP as the original [Cluster URL](../cluster-setup/cluster-prerequisites.md#cluster-url) DNS. +1. Create a second DNS entry `*.`, pointing to the same IP as the original [Cluster URL](../cluster-setup/cluster-prerequisites.md#cluster-url) DNS. 2. Obtain a __star__ SSL certificate for this DNS. 3. Add the certificate as a secret: -=== "SaaS" - ``` - kubectl create secret tls runai-cluster-domain-star-tls-secret -n runai \ - --cert /path/to/fullchain.pem --key /path/to/private.pem - ``` - -=== "Self hosted" - ``` - kubectl create secret tls runai-cluster-domain-star-tls-secret -n runai-backend \ - --cert /path/to/fullchain.pem --key /path/to/private.pem - ``` - -4. Create an ingress rule to direct traffic: - -=== "SaaS" - ``` - kubectl patch ingress researcher-service-ingress -n runai --type json \ - --patch '[{ "op": "add", "path": "/spec/tls/-", "value": { "hosts": [ "*." ], "secretName": "runai-cluster-domain-star-tls-secret" } }]' - ``` - -=== "Self hosted" - ``` - kubectl patch ingress runai-backend-ingress -n runai-backend --type json \ - --patch '[{ "op": "add", "path": "/spec/tls/-", "value": { "hosts": [ "*." ], "secretName": "runai-cluster-domain-star-tls-secret" } }]' - ``` +``` +kubectl create secret tls runai-cluster-domain-star-tls-secret -n runai \ + --cert /path/to/fullchain.pem --key /path/to/private.pem +``` + +4. Create the following ingress rule: + +``` YAML +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: runai-cluster-domain-star-ingress + namespace: runai +spec: + ingressClassName: nginx + rules: + - host: '*.' + tls: + - hosts: + - '*.' + secretName: runai-cluster-domain-star-tls-secret +``` + +Replace `` as described above. 5. Edit Runaiconfig to generate the URLs correctly: diff --git a/docs/admin/runai-setup/self-hosted/k8s/additional-clusters.md b/docs/admin/runai-setup/self-hosted/k8s/additional-clusters.md index 36c11f465b..79b667cfb5 100644 --- a/docs/admin/runai-setup/self-hosted/k8s/additional-clusters.md +++ b/docs/admin/runai-setup/self-hosted/k8s/additional-clusters.md @@ -2,8 +2,6 @@ The first Run:ai cluster is typically installed on the same Kubernetes cluster as the Run:ai control plane. Run:ai supports multiple clusters per single control plane. This document is about installing additional clusters on __different Kubernetes clusters__. -The instructions are for Run:ai version 2.8 and up. - ## Installation diff --git a/docs/developer/cluster-api/submit-yaml.md b/docs/developer/cluster-api/submit-yaml.md index f88735a222..1b9c8664ff 100644 --- a/docs/developer/cluster-api/submit-yaml.md +++ b/docs/developer/cluster-api/submit-yaml.md @@ -106,12 +106,13 @@ spec: container: 8000 ``` -1. Possible metrics can be `cpu-utilization`, `latency`, `throughput`, `concurrency`, `gpu-utilization`, `custom`. Different metrics may require additional [installations](../../admin/runai-setup/cluster-setup/cluster-prerequisites.md#inference) at the cluster level. +1. Possible metrics can be `cpu-utilization`, `latency`, `throughput`, `concurrency`, `gpu-utilization`, `custom`. Different metrics may require additional [installations](../../admin/runai-setup/cluster-setup/cluster-prerequisites.md#inference) at the cluster level. 2. Inference requires a port to receive requests. ## Suspend/Resume Interactive/Training Workload -to suspend trainig +To suspend training: + ```YAML apiVersion: run.ai/v2alpha1 kind: TrainingWorkload # @@ -128,9 +129,8 @@ spec: name: value: job-1 # ``` -In order to suspend workload set `active` value to `false` -To reume it back either set `active` value to `true` or remove it entirly. - +In order to suspend the workload, set `active` to `false`. +To resume the workload, either set `active` to `true` or remove it entirely. ## See Also * To understand how to connect to the inference workload, see [Inference Quickstart](../../Researcher/Walkthroughs/quickstart-inference.md). diff --git a/docs/home/whats-new-2-13.md b/docs/home/whats-new-2-13.md index 1d89e9aea2..5b5731deb1 100644 --- a/docs/home/whats-new-2-13.md +++ b/docs/home/whats-new-2-13.md @@ -20,6 +20,7 @@ July 2023 | RUN-11120 | Fixed an issue where the *Projects* table does not show correct metrics when Run:ai version 2.13 is paired with a Run:ai 2.8 cluster. | | RUN-11121 | Fixed an issue where the wrong over quota memory alert is shown in the *Quota management* pane in project edit form. | | RUN-11272 | Fixed an issue in OpenShift environments where the selection in the cluster drop down in the main UI does not match the cluster selected on the login page. | + ## Version 2.13.4 ### Release date diff --git a/docs/home/whats-new-2-14.md b/docs/home/whats-new-2-14.md index 39c22ddb78..f08edd6ea3 100644 --- a/docs/home/whats-new-2-14.md +++ b/docs/home/whats-new-2-14.md @@ -8,6 +8,27 @@ August 2023 #### Release content +This version contains features and fixes from previous versions starting with 2.9. Refer to the prior versions for specific features and fixes. For information about features, functionality, and fixed issues in previous versions see: + +* [What's new 2.13](whats-new-2-13.md) +* [What's new 2.12](whats-new-2-12.md) +* [What's new 2.10](whats-new-2-10.md) + + +##### Role based access control + +Stating in this version, Run:ai had updated the authorization system to Role Based Access Control (RBAC). RBAC is a policy-neutral access control mechanism defined around roles and privileges. For more information, see [Role based access control](../admin/runai-setup/access-control/rbac.md#role-based-access-control). + +When upgrading the system, previous access and authorizations that were configured will be migrated to the new RBAC roles. See the table below for role conversions: + +| Previous user type | RBAC role | +| -- | -- | +| Admin | [System admin](../admin/runai-setup/access-control/rbac.md#roles) | +| next one | [next one](../admin/runai-setup/access-control/rbac.md#roles) | + + + + #### Fixed issues | Internal ID | Description | diff --git a/docs/images/scope-icon.svg b/docs/images/scope-icon.svg new file mode 100644 index 0000000000..9a4c79019a --- /dev/null +++ b/docs/images/scope-icon.svg @@ -0,0 +1,3 @@ + + + diff --git a/mkdocs.yml b/mkdocs.yml index 2135477dbd..7be770db30 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -2,7 +2,7 @@ site_name: Run:ai Documentation Library site_url: https://docs.run.ai/ copyright: Copyright © 2020 - 2023 Run:ai repo_url: https://github.com/run-ai/docs/ -edit_uri: edit/v2.13/docs/ +edit_uri: edit/v2.14/docs/ docs_dir: docs theme: name: material @@ -12,11 +12,15 @@ theme: logo: images/RUNAI-LOGO-DIGITAL-2C_WP.svg features: - navigation.tabs + - navigation.tabs.sticky - search.highlight - content.code.annotate - content.tabs.link - search.suggest - content.action.edit + - navigation.top + - toc.follow + # - toc.integrate icon: edit: material/pencil view: material/eye @@ -26,13 +30,18 @@ extra_css: # strict: true markdown_extensions: + - footnotes # - markdown_include.include - pymdownx.highlight: anchor_linenums: true - pymdownx.inlinehilite - pymdownx.snippets: base_path: docs/docs - - pymdownx.superfences + - pymdownx.superfences: + custom_fences: + - name: mermaid + class: mermaid + format: !!python/name:pymdownx.superfences.fence_code_format - pymdownx.tabbed: alternate_style: true - pymdownx.details @@ -140,6 +149,7 @@ nav: - 'Overview': 'index.md' - 'System Components' : 'home/components.md' - 'Whats New' : + - 'Version 2.14' : 'home/whats-new-2-14.md' - 'Version 2.13' : 'home/whats-new-2-13.md' - 'Version 2.12' : 'home/whats-new-2-12.md' - 'Version 2.10' : 'home/whats-new-2-10.md' @@ -195,8 +205,9 @@ nav: - 'Install Administrator CLI' : 'admin/runai-setup/config/cli-admin-install.md' - 'Disaster Recovery' : 'admin/runai-setup/config/dr.md' - 'Node Affinity with Cloud Node Pools' : 'admin/runai-setup/config/node-affinity-with-cloud-node-pools.md' - - 'Authentication' : + - 'Authentication and Access Control' : - 'Overview' : 'admin/runai-setup/authentication/authentication-overview.md' + - 'Access control' : 'admin/runai-setup/access-control/rbac.md' - 'Researcher Authentication' : 'admin/runai-setup/authentication/researcher-authentication.md' - 'Single Sign-On' : 'admin/runai-setup/authentication/sso.md' - 'Maintenance' :