You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/cluster.md
+99-14Lines changed: 99 additions & 14 deletions
Original file line number
Diff line number
Diff line change
@@ -25,8 +25,9 @@ sbatch \
25
25
ray.sub
26
26
```
27
27
28
-
Notes:
29
-
* Some clusters may or may not need `--gres=gpu:8` to be added to the `sbatch` command.
28
+
```{tip}
29
+
Some Slurm clusters may or may not need `--gres=gpu:8` to be added to the `sbatch` command.
30
+
```
30
31
31
32
Which will print the `SLURM_JOB_ID`:
32
33
```text
@@ -73,23 +74,107 @@ Now that you are on the head node, you can launch the command as follows:
73
74
uv run ./examples/run_grpo_math.py
74
75
```
75
76
76
-
### Slurm UV_CACHE_DIR
77
+
### Slurm Environment Variables
77
78
78
-
There several choices for `UV_CACHE_DIR` when using `ray.sub`:
79
+
All Slurm environment variables described below can be added to the `sbatch`
80
+
invocation of `ray.sub`. For example, `GPUS_PER_NODE=8` can be specified as follows:
79
81
80
-
1. (default) `UV_CACHE_DIR` defaults to `$SLURM_SUBMIT_DIR/uv_cache` when not specified the shell environment, and is mounted to head and worker nodes to serve as a persistent cache between runs.
81
-
2. Use the warm uv cache from our docker images:
82
-
```sh
83
-
...
84
-
UV_CACHE_DIR=/home/ray/.cache/uv \
85
-
sbatch ... \
86
-
ray.sub
82
+
```sh
83
+
GPUS_PER_NODE=8 \
84
+
... \
85
+
sbatch ray.sub \
86
+
...
87
+
```
88
+
#### Common Environment Configuration
89
+
``````{list-table}
90
+
:header-rows: 1
91
+
92
+
* - Environment Variable
93
+
- Explanation
94
+
* - `CONTAINER`
95
+
- (Required) Specifies the container image to be used for the Ray cluster.
96
+
Use either a docker image from a registry or a squashfs (if using enroot/pyxis).
97
+
* - `MOUNTS`
98
+
- (Required) Defines paths to mount into the container. Examples:
99
+
```md
100
+
* `MOUNTS="$PWD:$PWD"` (mount in current working directory (CWD))
101
+
* `MOUNTS="$PWD:$PWD,/nfs:/nfs:ro"` (mounts the current working directory and `/nfs`, with `/nfs` mounted as read-only)
87
102
```
103
+
* - `COMMAND`
104
+
- Command to execute after the Ray cluster starts. If empty, the cluster idles and enters interactive mode (see the [Slurm interactive instructions](#interactive-launching)).
105
+
* - `HF_HOME`
106
+
- Sets the cache directory for huggingface-hub assets (e.g., models/tokenizers).
107
+
* - `WANDB_API_KEY`
108
+
- Setting this allows you to use the wandb logger without having to run `wandb login`.
109
+
* - `HF_TOKEN`
110
+
- Setting the token used by huggingface-hub. Avoids having to run the `huggingface-cli login`
111
+
* - `HF_DATASETS_CACHE`
112
+
- Sets the cache dir for downloaded Huggingface datasets.
113
+
``````
88
114
89
-
(1) is more efficient in general since the cache is not ephemeral and is persisted run to run; but for users that
90
-
don't want to persist the cache, you can use (2), which is just as performant as (1) if the `uv.lock` is
91
-
covered by warmed cache.
115
+
:::{tip}
116
+
When `HF_TOKEN`, `WANDB_API_KEY`, `HF_HOME`, and `HF_DATASETS_CACHE` are set in your shell environment using `export`, they are automatically passed to `ray.sub`. For instance, if you set:
92
117
118
+
```sh
119
+
export HF_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
120
+
```
121
+
this token will be available to your NeMo RL run. Consider adding these exports to your shell configuration file, such as `~/.bashrc`.
122
+
:::
123
+
124
+
#### Advanced Environment Configuration
125
+
``````{list-table}
126
+
:header-rows: 1
127
+
128
+
* - Environment Variable
129
+
(and default)
130
+
- Explanation
131
+
* - `UV_CACHE_DIR_OVERRIDE`
132
+
- By default, this variable does not need to be set. If unset, `ray.sub` uses the
133
+
`UV_CACHE_DIR` defined within the container (defaulting to `/root/.cache/uv`).
134
+
`ray.sub` intentionally avoids using the `UV_CACHE_DIR` from the user's host
135
+
environment to prevent the host's cache from interfering with the container's cache.
136
+
Set `UV_CACHE_DIR_OVERRIDE` if you have a customized `uv` environment (e.g.,
137
+
with pre-downloaded packages or specific configurations) that you want to persist
138
+
and reuse across container runs. This variable should point to a path on a shared
139
+
filesystem accessible by all nodes (head and workers). This path will be mounted
140
+
into the container and will override the container's default `UV_CACHE_DIR`.
141
+
* - `CPUS_PER_WORKER=128`
142
+
- CPUs each Ray worker node claims. Default is `16 * GPUS_PER_NODE`.
143
+
* - `GPUS_PER_NODE=8`
144
+
- Number of GPUs each Ray worker node claims. To determine this, run `nvidia-smi` on a worker node.
145
+
* - `BASE_LOG_DIR=$SLURM_SUBMIT_DIR`
146
+
- Base directory for storing Ray logs. Defaults to the Slurm submission directory ([SLURM_SUBMIT_DIR](https://slurm.schedmd.com/sbatch.html#OPT_SLURM_SUBMIT_DIR)).
147
+
* - `NODE_MANAGER_PORT=53001`
148
+
- Port for the Ray node manager on worker nodes.
149
+
* - `OBJECT_MANAGER_PORT=53003`
150
+
- Port for the Ray object manager on worker nodes.
151
+
* - `RUNTIME_ENV_AGENT_PORT=53005`
152
+
- Port for the Ray runtime environment agent on worker nodes.
153
+
* - `DASHBOARD_AGENT_GRPC_PORT=53007`
154
+
- gRPC port for the Ray dashboard agent on worker nodes.
155
+
* - `METRICS_EXPORT_PORT=53009`
156
+
- Port for exporting metrics from worker nodes.
157
+
* - `PORT=6379`
158
+
- Main port for the Ray head node.
159
+
* - `RAY_CLIENT_SERVER_PORT=10001`
160
+
- Port for the Ray client server on the head node.
161
+
* - `DASHBOARD_GRPC_PORT=52367`
162
+
- gRPC port for the Ray dashboard on the head node.
163
+
* - `DASHBOARD_PORT=8265`
164
+
- Port for the Ray dashboard UI on the head node. This is also the port
165
+
used by the Ray distributed debugger.
166
+
* - `DASHBOARD_AGENT_LISTEN_PORT=52365`
167
+
- Listening port for the dashboard agent on the head node.
168
+
* - `MIN_WORKER_PORT=54001`
169
+
- Minimum port in the range for Ray worker processes.
170
+
* - `MAX_WORKER_PORT=54257`
171
+
- Maximum port in the range for Ray worker processes.
172
+
``````
173
+
174
+
:::{note}
175
+
For the most part, you will not need to change ports unless these
176
+
are already taken by some other service backgrounded on your cluster.
0 commit comments