Merge pull request #1118 from run-ai/node-maint

node-maint-fixes
run-ai · Sep 22, 2024 · f2785d2 · f2785d2
2 parents 8df082a + f2fc088
commit f2785d2
Show file tree

Hide file tree

Showing 2 changed files with 13 additions and 17 deletions.
diff --git a/docs/admin/maintenance/node-downtime.md b/docs/admin/maintenance/node-downtime.md
@@ -4,39 +4,35 @@ This article provides detailed instructions on how to manage both planned and un
 ## Prerequisites
 
 * __Access to Kubernetes cluster__  
-  You must have administrative access to the Kubernetes cluster, including permissions to run `kubectl` commands  
+  Administrative access to the Kubernetes cluster, including permissions to run `kubectl` commands  
 * __Basic knowledge of Kubernetes__  
   Familiarity with Kubernetes concepts such as nodes, taints, and workloads  
 * __Run:ai installation__  
-  The Run:ai software installed and configured within your Kubernetes cluster  
-* __Disaster recovery plan__  
-  For Self-hosted installations, ensure that a disaster recovery plan is in place, particularly for preserving the Run:ai data. For more details, see [backup & restore](https://portal.document360.io/saas/docs/backup-restore-1).  
+  The Run:ai software installed and configured within your Kubernetes cluster   
 * __Node naming conventions__  
   Know the names of the nodes within your cluster, as these are required when executing the commands
 
 ## Node types
 
 This article distinguishes between two types of nodes within a Run:ai installation:
 
-* Worker nodes  
-   Nodes are where workloads are executed  
-* Run:ai system nodes  
-  Nodes on which the Run:ai software runs, managing the cluster's operations
+* Worker nodes. Nodes on which AI practitioners can submit and run workloads
+* Run:ai system nodes. Nodes on which the Run:ai software runs, managing the cluster's operations
 
 ### Worker nodes
 
 Worker Nodes are responsible for running workloads. When a worker node goes down, either due to planned maintenance or unexpected failure, workloads ideally migrate to other available nodes or wait in the queue to be executed when possible.
 
 #### Training vs. Interactive workloads
 
-Run:ai distinguishes between two types of workloads:
+The following workload types can run on worker nodes: 
 
-* __Training workloads__  
-  These are long-running processes that, in case of node downtime, can automatically move to another node.  
-* __Interactive workloads__  
-  These are short-lived, interactive processes that require manual intervention to be relocated to another node.
+* __Training workloads__. These are long-running processes that, in case of node downtime, can automatically move to another node.
 
-While training workloads can be automatically migrated, it is recommended to manually manage this process for faster response, as it may take time for Kubernetes to detect a node failure.
+* __Interactive workloads__. These are short-lived, interactive processes that require manual intervention to be relocated to another node.
+
+!!! Note
+    While training workloads can be automatically migrated, it is recommended to plan maintenance and manually manage this process for a faster response, as it may take time for Kubernetes to detect a node failure,
 
 #### Planned maintenance
 
@@ -117,9 +113,9 @@ In the event of unplanned downtime:
 
 ### Run:ai System nodes
 
-In a production environment, the Run:ai software operates on one or more Run:ai System Nodes. It is recommended to have more than one system node to ensure high availability. For more information, see [high availability](../config/ha.md). If one system node goes down, another can take over, maintaining continuity. If a second System Node does not exist, you must designate another node in the cluster as a temporary Run:ai System Node to maintain operations.
+In a production environment, the services responsible for scheduling, submitting and managing Run:ai workloads operate on one or more Run:ai system nodes. It is recommended to have more than one system node to ensure [high availability](../config/ha.md). If one system node goes down, another can take over, maintaining continuity. If a second system node does not exist, you must designate another node in the cluster as a temporary Run:ai system node to maintain operations.
 
-The protocols for handling planned maintenance and unplanned downtime are identical to those for Worker Nodes. Refer to the above section for detailed instructions.
+The protocols for handling planned maintenance and unplanned downtime are identical to those for worker nodes. Refer to the above section for detailed instructions. 
 
 ## Rejoining a node into the Kubernetes cluster
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -233,7 +233,7 @@ nav:
       - 'Setup cluster wide PVC' : 'admin/config/cluster-wide-pvc.md'
       - 'Group Nodes' : 'admin/config/limit-to-node-group.md'
     - 'Maintenance' : 
-      - 'Node Downtime' : 'admin/maintenance/node-downtime.md'
+      - 'Node Maintenance' : 'admin/maintenance/node-downtime.md'
       - 'System Monitoring' : 'admin/maintenance/alert-monitoring.md'
       - 'Audit Log' : 'admin/maintenance/audit-log.md'
     - 'Authentication & Authorization' :