Skip to content

Commit 4415e55

Browse files
committed
wip
Signed-off-by: Terry Kong <[email protected]> fix it Signed-off-by: Terry Kong <[email protected]> make it match Signed-off-by: Terry Kong <[email protected]> fix it all Signed-off-by: Terry Kong <[email protected]>
1 parent 383ed0b commit 4415e55

File tree

9 files changed

+111
-31
lines changed

9 files changed

+111
-31
lines changed

docs/cluster.md

Lines changed: 99 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,9 @@ sbatch \
2525
ray.sub
2626
```
2727

28-
Notes:
29-
* Some clusters may or may not need `--gres=gpu:8` to be added to the `sbatch` command.
28+
```{tip}
29+
Some Slurm clusters may or may not need `--gres=gpu:8` to be added to the `sbatch` command.
30+
```
3031

3132
Which will print the `SLURM_JOB_ID`:
3233
```text
@@ -73,23 +74,107 @@ Now that you are on the head node, you can launch the command as follows:
7374
uv run ./examples/run_grpo_math.py
7475
```
7576

76-
### Slurm UV_CACHE_DIR
77+
### Slurm Environment Variables
7778

78-
There several choices for `UV_CACHE_DIR` when using `ray.sub`:
79+
All Slurm environment variables described below can be added to the `sbatch`
80+
invocation of `ray.sub`. For example, `GPUS_PER_NODE=8` can be specified as follows:
7981

80-
1. (default) `UV_CACHE_DIR` defaults to `$SLURM_SUBMIT_DIR/uv_cache` when not specified the shell environment, and is mounted to head and worker nodes to serve as a persistent cache between runs.
81-
2. Use the warm uv cache from our docker images:
82-
```sh
83-
...
84-
UV_CACHE_DIR=/home/ray/.cache/uv \
85-
sbatch ... \
86-
ray.sub
82+
```sh
83+
GPUS_PER_NODE=8 \
84+
... \
85+
sbatch ray.sub \
86+
...
87+
```
88+
#### Common Environment Configuration
89+
``````{list-table}
90+
:header-rows: 1
91+
92+
* - Environment Variable
93+
- Explanation
94+
* - `CONTAINER`
95+
- (Required) Specifies the container image to be used for the Ray cluster.
96+
Use either a docker image from a registry or a squashfs (if using enroot/pyxis).
97+
* - `MOUNTS`
98+
- (Required) Defines paths to mount into the container. Examples:
99+
```md
100+
* `MOUNTS="$PWD:$PWD"` (mount in current working directory (CWD))
101+
* `MOUNTS="$PWD:$PWD,/nfs:/nfs:ro"` (mounts the current working directory and `/nfs`, with `/nfs` mounted as read-only)
87102
```
103+
* - `COMMAND`
104+
- Command to execute after the Ray cluster starts. If empty, the cluster idles and enters interactive mode (see the [Slurm interactive instructions](#interactive-launching)).
105+
* - `HF_HOME`
106+
- Sets the cache directory for huggingface-hub assets (e.g., models/tokenizers).
107+
* - `WANDB_API_KEY`
108+
- Setting this allows you to use the wandb logger without having to run `wandb login`.
109+
* - `HF_TOKEN`
110+
- Setting the token used by huggingface-hub. Avoids having to run the `huggingface-cli login`
111+
* - `HF_DATASETS_CACHE`
112+
- Sets the cache dir for downloaded Huggingface datasets.
113+
``````
88114

89-
(1) is more efficient in general since the cache is not ephemeral and is persisted run to run; but for users that
90-
don't want to persist the cache, you can use (2), which is just as performant as (1) if the `uv.lock` is
91-
covered by warmed cache.
115+
:::{tip}
116+
When `HF_TOKEN`, `WANDB_API_KEY`, `HF_HOME`, and `HF_DATASETS_CACHE` are set in your shell environment using `export`, they are automatically passed to `ray.sub`. For instance, if you set:
92117

118+
```sh
119+
export HF_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
120+
```
121+
this token will be available to your NeMo RL run. Consider adding these exports to your shell configuration file, such as `~/.bashrc`.
122+
:::
123+
124+
#### Advanced Environment Configuration
125+
``````{list-table}
126+
:header-rows: 1
127+
128+
* - Environment Variable
129+
(and default)
130+
- Explanation
131+
* - `UV_CACHE_DIR_OVERRIDE`
132+
- By default, this variable does not need to be set. If unset, `ray.sub` uses the
133+
`UV_CACHE_DIR` defined within the container (defaulting to `/root/.cache/uv`).
134+
`ray.sub` intentionally avoids using the `UV_CACHE_DIR` from the user's host
135+
environment to prevent the host's cache from interfering with the container's cache.
136+
Set `UV_CACHE_DIR_OVERRIDE` if you have a customized `uv` environment (e.g.,
137+
with pre-downloaded packages or specific configurations) that you want to persist
138+
and reuse across container runs. This variable should point to a path on a shared
139+
filesystem accessible by all nodes (head and workers). This path will be mounted
140+
into the container and will override the container's default `UV_CACHE_DIR`.
141+
* - `CPUS_PER_WORKER=128`
142+
- CPUs each Ray worker node claims. Default is `16 * GPUS_PER_NODE`.
143+
* - `GPUS_PER_NODE=8`
144+
- Number of GPUs each Ray worker node claims. To determine this, run `nvidia-smi` on a worker node.
145+
* - `BASE_LOG_DIR=$SLURM_SUBMIT_DIR`
146+
- Base directory for storing Ray logs. Defaults to the Slurm submission directory ([SLURM_SUBMIT_DIR](https://slurm.schedmd.com/sbatch.html#OPT_SLURM_SUBMIT_DIR)).
147+
* - `NODE_MANAGER_PORT=53001`
148+
- Port for the Ray node manager on worker nodes.
149+
* - `OBJECT_MANAGER_PORT=53003`
150+
- Port for the Ray object manager on worker nodes.
151+
* - `RUNTIME_ENV_AGENT_PORT=53005`
152+
- Port for the Ray runtime environment agent on worker nodes.
153+
* - `DASHBOARD_AGENT_GRPC_PORT=53007`
154+
- gRPC port for the Ray dashboard agent on worker nodes.
155+
* - `METRICS_EXPORT_PORT=53009`
156+
- Port for exporting metrics from worker nodes.
157+
* - `PORT=6379`
158+
- Main port for the Ray head node.
159+
* - `RAY_CLIENT_SERVER_PORT=10001`
160+
- Port for the Ray client server on the head node.
161+
* - `DASHBOARD_GRPC_PORT=52367`
162+
- gRPC port for the Ray dashboard on the head node.
163+
* - `DASHBOARD_PORT=8265`
164+
- Port for the Ray dashboard UI on the head node. This is also the port
165+
used by the Ray distributed debugger.
166+
* - `DASHBOARD_AGENT_LISTEN_PORT=52365`
167+
- Listening port for the dashboard agent on the head node.
168+
* - `MIN_WORKER_PORT=54001`
169+
- Minimum port in the range for Ray worker processes.
170+
* - `MAX_WORKER_PORT=54257`
171+
- Maximum port in the range for Ray worker processes.
172+
``````
173+
174+
:::{note}
175+
For the most part, you will not need to change ports unless these
176+
are already taken by some other service backgrounded on your cluster.
177+
:::
93178

94179
## Kubernetes
95180

nemo_rl/distributed/virtual_cluster.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -67,9 +67,6 @@ def init_ray(log_dir: Optional[str] = None):
6767
If that cluster uses the same CUDA_VISIBLE_DEVICES or Slurm managed tag we will reuse it.
6868
Otherwise, we will detach and start a fresh local cluster.
6969
"""
70-
if "UV_CACHE_DIR" not in os.environ:
71-
logging.warning("UV_CACHE_DIR is not set, using default cache dir")
72-
7370
# Set up runtime environment
7471
runtime_env = {
7572
"env_vars": dict(os.environ), # Pass thru all user environment variables

ray.sub

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -54,10 +54,17 @@ MIN_WORKER_PORT=${MIN_WORKER_PORT:-54001}
5454
MAX_WORKER_PORT=${MAX_WORKER_PORT:-54257}
5555
########################################################
5656

57-
# Defaults to placing uv cache inside the SLURM_SUBMIT_DIR
58-
# This directory is mounted into the container at /home/ray/.cache/uv so it is shared between the head and worker nodes
59-
UV_CACHE_DIR="${UV_CACHE_DIR:-$SLURM_SUBMIT_DIR/uv_cache}"
60-
mkdir -p $UV_CACHE_DIR
57+
# Unset UV_CACHE_DIR to avoid local cache directory interferring with the container cache
58+
unset UV_CACHE_DIR
59+
60+
if [[ -n "${UV_CACHE_DIR_OVERRIDE:-}" ]]; then
61+
mkdir -p "$UV_CACHE_DIR_OVERRIDE"
62+
if [[ -n $MOUNTS ]]; then
63+
MOUNTS+=",$UV_CACHE_DIR_OVERRIDE:/root/.cache/uv"
64+
else
65+
MOUNTS="$UV_CACHE_DIR_OVERRIDE:/root/.cache/uv"
66+
fi
67+
fi
6168

6269
# Create logs directory
6370
BASE_LOG_DIR=${BASE_LOG_DIR:-$SLURM_SUBMIT_DIR}
@@ -67,7 +74,7 @@ mkdir -p $LOG_DIR
6774
COMMON_SRUN_ARGS=""
6875
COMMON_SRUN_ARGS+=" --no-container-mount-home"
6976
COMMON_SRUN_ARGS+=" --mpi=pmix"
70-
COMMON_SRUN_ARGS+=" --container-mounts=$MOUNTS,$UV_CACHE_DIR:/home/ray/.cache/uv"
77+
COMMON_SRUN_ARGS+=" --container-mounts=$MOUNTS"
7178
COMMON_SRUN_ARGS+=" --container-image=$CONTAINER"
7279
COMMON_SRUN_ARGS+=" --container-workdir=$SLURM_SUBMIT_DIR"
7380
# TODO: delete these (just for debugging)

tests/functional/dpo.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ EXP_DIR=$SCRIPT_DIR/$EXP_NAME
1212
LOG_DIR=$EXP_DIR/logs
1313
JSON_METRICS=$EXP_DIR/metrics.json
1414
RUN_LOG=$EXP_DIR/run.log
15-
export UV_CACHE_DIR=${UV_CACHE_DIR:-$PROJECT_ROOT/uv_cache}
1615
export PYTHONPATH=${PROJECT_ROOT}:${PYTHONPATH:-}
1716

1817
rm -rf $EXP_DIR $LOG_DIR

tests/functional/eval.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ EXP_DIR=$SCRIPT_DIR/$EXP_NAME
1212
LOG_DIR=$EXP_DIR/logs
1313
JSON_METRICS=$EXP_DIR/metrics.json
1414
RUN_LOG=$EXP_DIR/run.log
15-
export UV_CACHE_DIR=${UV_CACHE_DIR:-$PROJECT_ROOT/uv_cache}
1615
export PYTHONPATH=${PROJECT_ROOT}:${PYTHONPATH:-}
1716

1817
rm -rf $EXP_DIR $LOG_DIR

tests/functional/grpo.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ EXP_DIR=$SCRIPT_DIR/$EXP_NAME
1212
LOG_DIR=$EXP_DIR/logs
1313
JSON_METRICS=$EXP_DIR/metrics.json
1414
RUN_LOG=$EXP_DIR/run.log
15-
export UV_CACHE_DIR=${UV_CACHE_DIR:-$PROJECT_ROOT/uv_cache}
1615
export PYTHONPATH=${PROJECT_ROOT}:${PYTHONPATH:-}
1716

1817
rm -rf $EXP_DIR $LOG_DIR

tests/functional/grpo_multiturn.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ EXP_DIR=$SCRIPT_DIR/$EXP_NAME
1212
LOG_DIR=$EXP_DIR/logs
1313
JSON_METRICS=$EXP_DIR/metrics.json
1414
RUN_LOG=$EXP_DIR/run.log
15-
export UV_CACHE_DIR=${UV_CACHE_DIR:-$PROJECT_ROOT/uv_cache}
1615
export PYTHONPATH=${PROJECT_ROOT}:${PYTHONPATH:-}
1716

1817
rm -rf $EXP_DIR $LOG_DIR

tests/functional/sft.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@ EXP_DIR=$SCRIPT_DIR/$EXP_NAME
1515
LOG_DIR=$EXP_DIR/logs
1616
JSON_METRICS=$EXP_DIR/metrics.json
1717
RUN_LOG=$EXP_DIR/run.log
18-
export UV_CACHE_DIR=${UV_CACHE_DIR:-$PROJECT_ROOT/uv_cache}
1918
export PYTHONPATH=${PROJECT_ROOT}:${PYTHONPATH:-}
2019

2120
rm -rf $EXP_DIR $LOG_DIR

tests/run_functional_in_docker.sh

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,10 +28,8 @@ CONTAINER=${CONTAINER}
2828

2929
export HF_HOME=${HF_HOME:-$(realpath $SCRIPT_DIR/../hf_home)}
3030
export HF_DATASETS_CACHE=${HF_DATASETS_CACHE:-$(realpath $SCRIPT_DIR/../hf_datasets_cache)}
31-
export UV_CACHE_DIR=${UV_CACHE_DIR:-$(realpath $SCRIPT_DIR/../uv_cache)}
3231
mkdir -p $HF_HOME
3332
mkdir -p $HF_DATASETS_CACHE
34-
mkdir -p $UV_CACHE_DIR
3533

3634
# Check if running in GitLab CI
3735
INTERACTIVE_FLAG=""
@@ -52,12 +50,10 @@ docker run -u root $INTERACTIVE_FLAG --ulimit memlock=-1 --ulimit stack=67108864
5250
-v "$PROJECT_ROOT:$PROJECT_ROOT" \
5351
-v $HF_HOME:/hf_home \
5452
-v $HF_DATASETS_CACHE:/hf_datasets_cache \
55-
-v $UV_CACHE_DIR:/uv_cache \
5653
-e WANDB_API_KEY \
5754
-e HF_TOKEN \
5855
-e HF_HOME=/hf_home \
5956
-e HF_DATASETS_CACHE=/hf_datasets_cache \
60-
-e UV_CACHE_DIR=/uv_cache \
6157
-e HOME=/tmp/ \
6258
-w $SCRIPT_DIR \
6359
"$CONTAINER" -- \

0 commit comments

Comments
 (0)