Skip to content

feat: parametrize GPUS_PER_NODE and CPUS_PER_WORKER in ray.sub #410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 23, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 100 additions & 9 deletions docs/cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This guide explains how to run NeMo RL with Ray on Slurm or Kubernetes.

## Slurm (Batched and Interactive)
## Use Slurm for Batched and Interactive Jobs

The following code provides instructions on how to use Slurm to run batched job submissions and run jobs interactively.

Expand All @@ -25,14 +25,15 @@ sbatch \
ray.sub
```

Notes:
* Some clusters may or may not need `--gres=gpu:8` to be added to the `sbatch` command.
```{tip}
Depending on your Slurm cluster configuration, you may or may not need to include the `--gres=gpu:8` option in the `sbatch` command.
```

Which will print the `SLURM_JOB_ID`:
Upon successful submission, Slurm will print the `SLURM_JOB_ID`:
```text
Submitted batch job 1980204
```
Make note of the the job submission number. Once the job begins, you can track its process in the driver logs which you can `tail`:
Make a note of the job submission number. Once the job begins, you can track its process in the driver logs which you can `tail`:
```sh
tail -f 1980204-logs/ray-driver.log
```
Expand All @@ -59,12 +60,11 @@ sbatch \
--gres=gpu:8 \
ray.sub
```
Which will print the `SLURM_JOB_ID`:
Upon successful submission, Slurm will print the `SLURM_JOB_ID`:
```text
Submitted batch job 1980204
```
Once the Ray cluster is up, a script should be created to attach to the Ray head node,
which you can use to launch experiments.
Once the Ray cluster is up, a script will be created to attach to the Ray head node. Run this script to launch experiments:
```sh
bash 1980204-attach.sh
```
Expand All @@ -90,7 +90,98 @@ There several choices for `UV_CACHE_DIR` when using `ray.sub`:
don't want to persist the cache, you can use (2), which is just as performant as (1) if the `uv.lock` is
covered by warmed cache.

### Slurm Environment Variables

All Slurm environment variables described below can be added to the `sbatch`
invocation of `ray.sub`. For example, `GPUS_PER_NODE=8` can be specified as follows:

```sh
GPUS_PER_NODE=8 \
... \
sbatch ray.sub \
...
```
#### Common Environment Configuration
``````{list-table}
:header-rows: 1

* - Environment Variable
- Explanation
* - `CONTAINER`
- (Required) Specifies the container image to be used for the Ray cluster.
Use either a docker image from a registry or a squashfs (if using enroot/pyxis).
* - `MOUNTS`
- (Required) Defines paths to mount into the container. Examples:
```md
* `MOUNTS="$PWD:$PWD"` (mount in current working directory (CWD))
* `MOUNTS="$PWD:$PWD,/nfs:/nfs:ro"` (mounts the current working directory and `/nfs`, with `/nfs` mounted as read-only)
```
* - `COMMAND`
- Command to execute after the Ray cluster starts. If empty, the cluster idles and enters interactive mode (see the [Slurm interactive instructions](#interactive-launching)).
* - `HF_HOME`
- Sets the cache directory for huggingface-hub assets (e.g., models/tokenizers).
* - `WANDB_API_KEY`
- Setting this allows you to use the wandb logger without having to run `wandb login`.
* - `HF_TOKEN`
- Setting the token used by huggingface-hub. Avoids having to run the `huggingface-cli login`
* - `HF_DATASETS_CACHE`
- Sets the cache dir for downloaded Huggingface datasets.
``````

:::{tip}
When `HF_TOKEN`, `WANDB_API_KEY`, `HF_HOME`, and `HF_DATASETS_CACHE` are set in your shell environment using `export`, they are automatically passed to `ray.sub`. For instance, if you set:

```sh
export HF_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
```
this token will be available to your NeMo RL run. Consider adding these exports to your shell configuration file, such as `~/.bashrc`.
:::

#### Advanced Environment Configuration
``````{list-table}
:header-rows: 1

* - Environment Variable
(and default)
- Explanation
* - `CPUS_PER_WORKER=128`
- CPUs each Ray worker node claims. Default is `16 * GPUS_PER_NODE`.
* - `GPUS_PER_NODE=8`
- Number of GPUs each Ray worker node claims. To determine this, run `nvidia-smi` on a worker node.
* - `BASE_LOG_DIR=$SLURM_SUBMIT_DIR`
- Base directory for storing Ray logs. Defaults to the Slurm submission directory ([SLURM_SUBMIT_DIR](https://slurm.schedmd.com/sbatch.html#OPT_SLURM_SUBMIT_DIR)).
* - `NODE_MANAGER_PORT=53001`
- Port for the Ray node manager on worker nodes.
* - `OBJECT_MANAGER_PORT=53003`
- Port for the Ray object manager on worker nodes.
* - `RUNTIME_ENV_AGENT_PORT=53005`
- Port for the Ray runtime environment agent on worker nodes.
* - `DASHBOARD_AGENT_GRPC_PORT=53007`
- gRPC port for the Ray dashboard agent on worker nodes.
* - `METRICS_EXPORT_PORT=53009`
- Port for exporting metrics from worker nodes.
* - `PORT=6379`
- Main port for the Ray head node.
* - `RAY_CLIENT_SERVER_PORT=10001`
- Port for the Ray client server on the head node.
* - `DASHBOARD_GRPC_PORT=52367`
- gRPC port for the Ray dashboard on the head node.
* - `DASHBOARD_PORT=8265`
- Port for the Ray dashboard UI on the head node. This is also the port
used by the Ray distributed debugger.
* - `DASHBOARD_AGENT_LISTEN_PORT=52365`
- Listening port for the dashboard agent on the head node.
* - `MIN_WORKER_PORT=54001`
- Minimum port in the range for Ray worker processes.
* - `MAX_WORKER_PORT=54257`
- Maximum port in the range for Ray worker processes.
``````

:::{note}
For the most part, you will not need to change ports unless these
are already taken by some other service backgrounded on your cluster.
:::

## Kubernetes

TBD
TBD
12 changes: 7 additions & 5 deletions ray.sub
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,10 @@ COMMON_SRUN_ARGS+=" -p $SLURM_JOB_PARTITION"
COMMON_SRUN_ARGS+=" -A $SLURM_JOB_ACCOUNT"
COMMON_SRUN_ARGS+=" --gres=gpu:8"

# Number of GPUs per node
gpus_per_node=8
# Number of GPUs per worker node
GPUS_PER_NODE=${GPUS_PER_NODE:-8}
# Number of CPUs per worker node
CPUS_PER_WORKER=${CPUS_PER_WORKER:-$((GPUS_PER_NODE * 16))}

num_retries=3

Expand Down Expand Up @@ -148,7 +150,7 @@ EOF
)
srun $COMMON_SRUN_ARGS --container-name=ray-head --nodes=1 --ntasks=1 -w "$head_node" -o $LOG_DIR/ray-head.log bash -x -c "$head_cmd" &

NUM_ACTORS=$((gpus_per_node * SLURM_JOB_NUM_NODES))
NUM_ACTORS=$((GPUS_PER_NODE * SLURM_JOB_NUM_NODES))

# Start Ray worker nodes
# We want 1 Ray worker node per physical node
Expand Down Expand Up @@ -183,7 +185,7 @@ monitor-sidecar &
cat <<EOFINNER | tee /launch-worker.sh
ray start --address "$ip_head" \
--disable-usage-stats \
--resources="{\"worker_units\": $gpus_per_node, \"slurm_managed_ray_cluster\": 1}" \
--resources="{\"worker_units\": $GPUS_PER_NODE, \"slurm_managed_ray_cluster\": 1}" \
--min-worker-port=${MIN_WORKER_PORT} \
--max-worker-port=${MAX_WORKER_PORT} \
\
Expand Down Expand Up @@ -211,7 +213,7 @@ EOF
if [[ $i -eq 0 ]]; then
OVERLAP_HEAD_AND_WORKER_ARG="--overlap"
fi
srun $COMMON_SRUN_ARGS ${OVERLAP_HEAD_AND_WORKER_ARG:-} --container-name=ray-worker-$i --exact --nodes=1 --ntasks=1 --cpus-per-task=$((16 * gpus_per_node)) -w "$node_i" -o $LOG_DIR/ray-worker-$i.log bash -x -c "$worker_cmd" &
srun $COMMON_SRUN_ARGS ${OVERLAP_HEAD_AND_WORKER_ARG:-} --container-name=ray-worker-$i --exact --nodes=1 --ntasks=1 --cpus-per-task=$CPUS_PER_WORKER -w "$node_i" -o $LOG_DIR/ray-worker-$i.log bash -x -c "$worker_cmd" &
sleep 3
done

Expand Down
Loading