NVIDIA · SahilJain314 · May 23, 2025 · May 18, 2025 · May 21, 2025 · May 21, 2025
diff --git a/docs/cluster.md b/docs/cluster.md
@@ -2,7 +2,7 @@
 
 This guide explains how to run NeMo RL with Ray on Slurm or Kubernetes.
 
-## Slurm (Batched and Interactive)
+## Use Slurm for Batched and Interactive Jobs
 
  The following code provides instructions on how to use Slurm to run batched job submissions and run jobs interactively.
 
@@ -25,14 +25,15 @@ sbatch \
     ray.sub
 ```
 
-Notes:
-* Some clusters may or may not need `--gres=gpu:8` to be added to the `sbatch` command.
+```{tip}
+Depending on your Slurm cluster configuration, you may or may not need to include the `--gres=gpu:8` option in the `sbatch` command.
+```
 
-Which will print the `SLURM_JOB_ID`:
+Upon successful submission, Slurm will print the `SLURM_JOB_ID`:
 ```text
 Submitted batch job 1980204
 ```
-Make note of the the job submission number. Once the job begins, you can track its process in the driver logs which you can `tail`:
+Make a note of the job submission number. Once the job begins, you can track its process in the driver logs which you can `tail`:
 ```sh
 tail -f 1980204-logs/ray-driver.log
 ```
@@ -59,12 +60,11 @@ sbatch \
     --gres=gpu:8 \
     ray.sub
 ```
-Which will print the `SLURM_JOB_ID`:
+Upon successful submission, Slurm will print the `SLURM_JOB_ID`:
 ```text
 Submitted batch job 1980204
 ```
-Once the Ray cluster is up, a script should be created to attach to the Ray head node,
-which you can use to launch experiments.
+Once the Ray cluster is up, a script will be created to attach to the Ray head node. Run this script to launch experiments:
 ```sh
 bash 1980204-attach.sh
 ```
@@ -90,7 +90,98 @@ There several choices for `UV_CACHE_DIR` when using `ray.sub`:
 don't want to persist the cache, you can use (2), which is just as performant as (1) if the `uv.lock` is 
 covered by warmed cache.
 
+### Slurm Environment Variables
+
+All Slurm environment variables described below can be added to the `sbatch`
+invocation of `ray.sub`. For example, `GPUS_PER_NODE=8` can be specified as follows:
+
+```sh
+GPUS_PER_NODE=8 \
+... \
+sbatch ray.sub \
+   ...
+```
+#### Common Environment Configuration
+``````{list-table}
+:header-rows: 1
+
+* - Environment Variable
+  - Explanation
+* - `CONTAINER`
+  - (Required) Specifies the container image to be used for the Ray cluster.
+    Use either a docker image from a registry or a squashfs (if using enroot/pyxis).
+* - `MOUNTS`
+  - (Required) Defines paths to mount into the container. Examples:
+    ```md
+    * `MOUNTS="$PWD:$PWD"` (mount in current working directory (CWD))
+    * `MOUNTS="$PWD:$PWD,/nfs:/nfs:ro"` (mounts the current working directory and `/nfs`, with `/nfs` mounted as read-only)
+    ```
+* - `COMMAND`
+  - Command to execute after the Ray cluster starts. If empty, the cluster idles and enters interactive mode (see the [Slurm interactive instructions](#interactive-launching)).
+* - `HF_HOME`
+  - Sets the cache directory for huggingface-hub assets (e.g., models/tokenizers).
+* - `WANDB_API_KEY`
+  - Setting this allows you to use the wandb logger without having to run `wandb login`.
+* - `HF_TOKEN`
+  - Setting the token used by huggingface-hub. Avoids having to run the `huggingface-cli login`
+* - `HF_DATASETS_CACHE`
+  - Sets the cache dir for downloaded Huggingface datasets.
+``````
+
+:::{tip}
+When `HF_TOKEN`, `WANDB_API_KEY`, `HF_HOME`, and `HF_DATASETS_CACHE` are set in your shell environment using `export`, they are automatically passed to `ray.sub`. For instance, if you set:
+
+```sh
+export HF_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+```
+this token will be available to your NeMo RL run. Consider adding these exports to your shell configuration file, such as `~/.bashrc`.
+:::
+
+#### Advanced Environment Configuration
+``````{list-table}
+:header-rows: 1
+
+* - Environment Variable
+    (and default)
+  - Explanation
+* - `CPUS_PER_WORKER=128`
+  - CPUs each Ray worker node claims. Default is `16 * GPUS_PER_NODE`.
+* - `GPUS_PER_NODE=8`
+  - Number of GPUs each Ray worker node claims. To determine this, run `nvidia-smi` on a worker node.
+* - `BASE_LOG_DIR=$SLURM_SUBMIT_DIR`
+  - Base directory for storing Ray logs. Defaults to the Slurm submission directory ([SLURM_SUBMIT_DIR](https://slurm.schedmd.com/sbatch.html#OPT_SLURM_SUBMIT_DIR)).
+* - `NODE_MANAGER_PORT=53001`
+  - Port for the Ray node manager on worker nodes.
+* - `OBJECT_MANAGER_PORT=53003`
+  - Port for the Ray object manager on worker nodes.
+* - `RUNTIME_ENV_AGENT_PORT=53005`
+  - Port for the Ray runtime environment agent on worker nodes.
+* - `DASHBOARD_AGENT_GRPC_PORT=53007`
+  - gRPC port for the Ray dashboard agent on worker nodes.
+* - `METRICS_EXPORT_PORT=53009`
+  - Port for exporting metrics from worker nodes.
+* - `PORT=6379`
+  - Main port for the Ray head node.
+* - `RAY_CLIENT_SERVER_PORT=10001`
+  - Port for the Ray client server on the head node.
+* - `DASHBOARD_GRPC_PORT=52367`
+  - gRPC port for the Ray dashboard on the head node.
+* - `DASHBOARD_PORT=8265`
+  - Port for the Ray dashboard UI on the head node. This is also the port
+    used by the Ray distributed debugger.
+* - `DASHBOARD_AGENT_LISTEN_PORT=52365`
+  - Listening port for the dashboard agent on the head node.
+* - `MIN_WORKER_PORT=54001`
+  - Minimum port in the range for Ray worker processes.
+* - `MAX_WORKER_PORT=54257`
+  - Maximum port in the range for Ray worker processes.
+``````
+
+:::{note}
+For the most part, you will not need to change ports unless these
+are already taken by some other service backgrounded on your cluster.
+:::
 
 ## Kubernetes
 
-TBD
+TBD
diff --git a/ray.sub b/ray.sub
@@ -61,8 +61,10 @@ COMMON_SRUN_ARGS+=" -p $SLURM_JOB_PARTITION"
 COMMON_SRUN_ARGS+=" -A $SLURM_JOB_ACCOUNT"
 COMMON_SRUN_ARGS+=" --gres=gpu:8"
 
-# Number of GPUs per node
-gpus_per_node=8
+# Number of GPUs per worker node
+GPUS_PER_NODE=${GPUS_PER_NODE:-8}
+# Number of CPUs per worker node
+CPUS_PER_WORKER=${CPUS_PER_WORKER:-$((GPUS_PER_NODE * 16))}
 
 num_retries=3
 
@@ -148,7 +150,7 @@ EOF
 )
 srun $COMMON_SRUN_ARGS --container-name=ray-head --nodes=1 --ntasks=1 -w "$head_node" -o $LOG_DIR/ray-head.log bash -x -c "$head_cmd" &
 
-NUM_ACTORS=$((gpus_per_node * SLURM_JOB_NUM_NODES))
+NUM_ACTORS=$((GPUS_PER_NODE * SLURM_JOB_NUM_NODES))
 
 # Start Ray worker nodes
 # We want 1 Ray worker node per physical node
@@ -183,7 +185,7 @@ monitor-sidecar &
 cat <<EOFINNER | tee /launch-worker.sh
 ray start --address "$ip_head" \
           --disable-usage-stats \
-          --resources="{\"worker_units\": $gpus_per_node, \"slurm_managed_ray_cluster\": 1}" \
+          --resources="{\"worker_units\": $GPUS_PER_NODE, \"slurm_managed_ray_cluster\": 1}" \
           --min-worker-port=${MIN_WORKER_PORT} \
           --max-worker-port=${MAX_WORKER_PORT} \
           \
@@ -211,7 +213,7 @@ EOF
   if [[ $i -eq 0 ]]; then
     OVERLAP_HEAD_AND_WORKER_ARG="--overlap"
   fi
-  srun $COMMON_SRUN_ARGS ${OVERLAP_HEAD_AND_WORKER_ARG:-} --container-name=ray-worker-$i --exact --nodes=1 --ntasks=1 --cpus-per-task=$((16 * gpus_per_node)) -w "$node_i" -o $LOG_DIR/ray-worker-$i.log bash -x -c "$worker_cmd" &
+  srun $COMMON_SRUN_ARGS ${OVERLAP_HEAD_AND_WORKER_ARG:-} --container-name=ray-worker-$i --exact --nodes=1 --ntasks=1 --cpus-per-task=$CPUS_PER_WORKER -w "$node_i" -o $LOG_DIR/ray-worker-$i.log bash -x -c "$worker_cmd" &
   sleep 3
 done