wip

terrykong · terrykong · commit 4415e55d69a0 · 2025-05-23T16:40:07.000-07:00
Signed-off-by: Terry Kong &lt;terryk@nvidia.com&gt;

fix it

Signed-off-by: Terry Kong &lt;terryk@nvidia.com&gt;

make it match

Signed-off-by: Terry Kong &lt;terryk@nvidia.com&gt;

fix it all

Signed-off-by: Terry Kong &lt;terryk@nvidia.com&gt;
diff --git a/docs/cluster.md b/docs/cluster.md
@@ -25,8 +25,9 @@ sbatch \
     ray.sub
 ```
 
-Notes:
-* Some clusters may or may not need `--gres=gpu:8` to be added to the `sbatch` command.
+```{tip}
+Some Slurm clusters may or may not need `--gres=gpu:8` to be added to the `sbatch` command.
+```
 
 Which will print the `SLURM_JOB_ID`:
 ```text
@@ -73,23 +74,107 @@ Now that you are on the head node, you can launch the command as follows:
 uv run ./examples/run_grpo_math.py
 ```
 
-### Slurm UV_CACHE_DIR
+### Slurm Environment Variables
 
-There several choices for `UV_CACHE_DIR` when using `ray.sub`:
+All Slurm environment variables described below can be added to the `sbatch`
+invocation of `ray.sub`. For example, `GPUS_PER_NODE=8` can be specified as follows:
 
-1. (default) `UV_CACHE_DIR` defaults to `$SLURM_SUBMIT_DIR/uv_cache` when not specified the shell environment, and is mounted to head and worker nodes to serve as a persistent cache between runs.
-2. Use the warm uv cache from our docker images:
-    ```sh
-    ...
-    UV_CACHE_DIR=/home/ray/.cache/uv \
-    sbatch ... \
-        ray.sub
+```sh
+GPUS_PER_NODE=8 \
+... \
+sbatch ray.sub \
+   ...
+```
+#### Common Environment Configuration
+``````{list-table}
+:header-rows: 1
+
+* - Environment Variable
+  - Explanation
+* - `CONTAINER`
+  - (Required) Specifies the container image to be used for the Ray cluster.
+    Use either a docker image from a registry or a squashfs (if using enroot/pyxis).
+* - `MOUNTS`
+  - (Required) Defines paths to mount into the container. Examples:
+    ```md
+    * `MOUNTS="$PWD:$PWD"` (mount in current working directory (CWD))
+    * `MOUNTS="$PWD:$PWD,/nfs:/nfs:ro"` (mounts the current working directory and `/nfs`, with `/nfs` mounted as read-only)
     ```
+* - `COMMAND`
+  - Command to execute after the Ray cluster starts. If empty, the cluster idles and enters interactive mode (see the [Slurm interactive instructions](#interactive-launching)).
+* - `HF_HOME`
+  - Sets the cache directory for huggingface-hub assets (e.g., models/tokenizers).
+* - `WANDB_API_KEY`
+  - Setting this allows you to use the wandb logger without having to run `wandb login`.
+* - `HF_TOKEN`
+  - Setting the token used by huggingface-hub. Avoids having to run the `huggingface-cli login`
+* - `HF_DATASETS_CACHE`
+  - Sets the cache dir for downloaded Huggingface datasets.
+``````
 
-(1) is more efficient in general since the cache is not ephemeral and is persisted run to run; but for users that
-don't want to persist the cache, you can use (2), which is just as performant as (1) if the `uv.lock` is 
-covered by warmed cache.
+:::{tip}
+When `HF_TOKEN`, `WANDB_API_KEY`, `HF_HOME`, and `HF_DATASETS_CACHE` are set in your shell environment using `export`, they are automatically passed to `ray.sub`. For instance, if you set:
 
+```sh
+export HF_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+```
+this token will be available to your NeMo RL run. Consider adding these exports to your shell configuration file, such as `~/.bashrc`.
+:::
+
+#### Advanced Environment Configuration
+``````{list-table}
+:header-rows: 1
+
+* - Environment Variable
+    (and default)
+  - Explanation
+* - `UV_CACHE_DIR_OVERRIDE`
+  - By default, this variable does not need to be set. If unset, `ray.sub` uses the 
+    `UV_CACHE_DIR` defined within the container (defaulting to `/root/.cache/uv`). 
+    `ray.sub` intentionally avoids using the `UV_CACHE_DIR` from the user's host 
+    environment to prevent the host's cache from interfering with the container's cache. 
+    Set `UV_CACHE_DIR_OVERRIDE` if you have a customized `uv` environment (e.g., 
+    with pre-downloaded packages or specific configurations) that you want to persist 
+    and reuse across container runs. This variable should point to a path on a shared 
+    filesystem accessible by all nodes (head and workers). This path will be mounted 
+    into the container and will override the container's default `UV_CACHE_DIR`.
+* - `CPUS_PER_WORKER=128`
+  - CPUs each Ray worker node claims. Default is `16 * GPUS_PER_NODE`.
+* - `GPUS_PER_NODE=8`
+  - Number of GPUs each Ray worker node claims. To determine this, run `nvidia-smi` on a worker node.
+* - `BASE_LOG_DIR=$SLURM_SUBMIT_DIR`
+  - Base directory for storing Ray logs. Defaults to the Slurm submission directory ([SLURM_SUBMIT_DIR](https://slurm.schedmd.com/sbatch.html#OPT_SLURM_SUBMIT_DIR)).
+* - `NODE_MANAGER_PORT=53001`
+  - Port for the Ray node manager on worker nodes.
+* - `OBJECT_MANAGER_PORT=53003`
+  - Port for the Ray object manager on worker nodes.
+* - `RUNTIME_ENV_AGENT_PORT=53005`
+  - Port for the Ray runtime environment agent on worker nodes.
+* - `DASHBOARD_AGENT_GRPC_PORT=53007`
+  - gRPC port for the Ray dashboard agent on worker nodes.
+* - `METRICS_EXPORT_PORT=53009`
+  - Port for exporting metrics from worker nodes.
+* - `PORT=6379`
+  - Main port for the Ray head node.
+* - `RAY_CLIENT_SERVER_PORT=10001`
+  - Port for the Ray client server on the head node.
+* - `DASHBOARD_GRPC_PORT=52367`
+  - gRPC port for the Ray dashboard on the head node.
+* - `DASHBOARD_PORT=8265`
+  - Port for the Ray dashboard UI on the head node. This is also the port
+    used by the Ray distributed debugger.
+* - `DASHBOARD_AGENT_LISTEN_PORT=52365`
+  - Listening port for the dashboard agent on the head node.
+* - `MIN_WORKER_PORT=54001`
+  - Minimum port in the range for Ray worker processes.
+* - `MAX_WORKER_PORT=54257`
+  - Maximum port in the range for Ray worker processes.
+``````
+
+:::{note}
+For the most part, you will not need to change ports unless these
+are already taken by some other service backgrounded on your cluster.
+:::
 
 ## Kubernetes
 
diff --git a/nemo_rl/distributed/virtual_cluster.py b/nemo_rl/distributed/virtual_cluster.py
@@ -67,9 +67,6 @@ def init_ray(log_dir: Optional[str] = None):
     If that cluster uses the same CUDA_VISIBLE_DEVICES or Slurm managed tag we will reuse it.
     Otherwise, we will detach and start a fresh local cluster.
     """
-    if "UV_CACHE_DIR" not in os.environ:
-        logging.warning("UV_CACHE_DIR is not set, using default cache dir")
-
     # Set up runtime environment
     runtime_env = {
         "env_vars": dict(os.environ),  # Pass thru all user environment variables
diff --git a/ray.sub b/ray.sub
@@ -54,10 +54,17 @@ MIN_WORKER_PORT=${MIN_WORKER_PORT:-54001}
 MAX_WORKER_PORT=${MAX_WORKER_PORT:-54257}
 ########################################################
 
-# Defaults to placing uv cache inside the SLURM_SUBMIT_DIR
-# This directory is mounted into the container at /home/ray/.cache/uv so it is shared between the head and worker nodes
-UV_CACHE_DIR="${UV_CACHE_DIR:-$SLURM_SUBMIT_DIR/uv_cache}"
-mkdir -p $UV_CACHE_DIR
+# Unset UV_CACHE_DIR to avoid local cache directory interferring with the container cache
+unset UV_CACHE_DIR
+
+if [[ -n "${UV_CACHE_DIR_OVERRIDE:-}" ]]; then
+  mkdir -p "$UV_CACHE_DIR_OVERRIDE"
+  if [[ -n $MOUNTS ]]; then
+    MOUNTS+=",$UV_CACHE_DIR_OVERRIDE:/root/.cache/uv"
+  else
+    MOUNTS="$UV_CACHE_DIR_OVERRIDE:/root/.cache/uv"
+  fi
+fi
 
 # Create logs directory
 BASE_LOG_DIR=${BASE_LOG_DIR:-$SLURM_SUBMIT_DIR}
@@ -67,7 +74,7 @@ mkdir -p $LOG_DIR
 COMMON_SRUN_ARGS=""
 COMMON_SRUN_ARGS+=" --no-container-mount-home"
 COMMON_SRUN_ARGS+=" --mpi=pmix"
-COMMON_SRUN_ARGS+=" --container-mounts=$MOUNTS,$UV_CACHE_DIR:/home/ray/.cache/uv"
+COMMON_SRUN_ARGS+=" --container-mounts=$MOUNTS"
 COMMON_SRUN_ARGS+=" --container-image=$CONTAINER"
 COMMON_SRUN_ARGS+=" --container-workdir=$SLURM_SUBMIT_DIR"
 # TODO: delete these (just for debugging)
diff --git a/tests/functional/dpo.sh b/tests/functional/dpo.sh
@@ -12,7 +12,6 @@ EXP_DIR=$SCRIPT_DIR/$EXP_NAME
 LOG_DIR=$EXP_DIR/logs
 JSON_METRICS=$EXP_DIR/metrics.json
 RUN_LOG=$EXP_DIR/run.log
-export UV_CACHE_DIR=${UV_CACHE_DIR:-$PROJECT_ROOT/uv_cache}
 export PYTHONPATH=${PROJECT_ROOT}:${PYTHONPATH:-}
 
 rm -rf $EXP_DIR $LOG_DIR
diff --git a/tests/functional/eval.sh b/tests/functional/eval.sh
@@ -12,7 +12,6 @@ EXP_DIR=$SCRIPT_DIR/$EXP_NAME
 LOG_DIR=$EXP_DIR/logs
 JSON_METRICS=$EXP_DIR/metrics.json
 RUN_LOG=$EXP_DIR/run.log
-export UV_CACHE_DIR=${UV_CACHE_DIR:-$PROJECT_ROOT/uv_cache}
 export PYTHONPATH=${PROJECT_ROOT}:${PYTHONPATH:-}
 
 rm -rf $EXP_DIR $LOG_DIR
diff --git a/tests/functional/grpo.sh b/tests/functional/grpo.sh
@@ -12,7 +12,6 @@ EXP_DIR=$SCRIPT_DIR/$EXP_NAME
 LOG_DIR=$EXP_DIR/logs
 JSON_METRICS=$EXP_DIR/metrics.json
 RUN_LOG=$EXP_DIR/run.log
-export UV_CACHE_DIR=${UV_CACHE_DIR:-$PROJECT_ROOT/uv_cache}
 export PYTHONPATH=${PROJECT_ROOT}:${PYTHONPATH:-}
 
 rm -rf $EXP_DIR $LOG_DIR
diff --git a/tests/functional/grpo_multiturn.sh b/tests/functional/grpo_multiturn.sh
@@ -12,7 +12,6 @@ EXP_DIR=$SCRIPT_DIR/$EXP_NAME
 LOG_DIR=$EXP_DIR/logs
 JSON_METRICS=$EXP_DIR/metrics.json
 RUN_LOG=$EXP_DIR/run.log
-export UV_CACHE_DIR=${UV_CACHE_DIR:-$PROJECT_ROOT/uv_cache}
 export PYTHONPATH=${PROJECT_ROOT}:${PYTHONPATH:-}
 
 rm -rf $EXP_DIR $LOG_DIR
diff --git a/tests/functional/sft.sh b/tests/functional/sft.sh
@@ -15,7 +15,6 @@ EXP_DIR=$SCRIPT_DIR/$EXP_NAME
 LOG_DIR=$EXP_DIR/logs
 JSON_METRICS=$EXP_DIR/metrics.json
 RUN_LOG=$EXP_DIR/run.log
-export UV_CACHE_DIR=${UV_CACHE_DIR:-$PROJECT_ROOT/uv_cache}
 export PYTHONPATH=${PROJECT_ROOT}:${PYTHONPATH:-}
 
 rm -rf $EXP_DIR $LOG_DIR
diff --git a/tests/run_functional_in_docker.sh b/tests/run_functional_in_docker.sh
@@ -28,10 +28,8 @@ CONTAINER=${CONTAINER}
 
 export HF_HOME=${HF_HOME:-$(realpath $SCRIPT_DIR/../hf_home)}
 export HF_DATASETS_CACHE=${HF_DATASETS_CACHE:-$(realpath $SCRIPT_DIR/../hf_datasets_cache)}
-export UV_CACHE_DIR=${UV_CACHE_DIR:-$(realpath $SCRIPT_DIR/../uv_cache)}
 mkdir -p $HF_HOME
 mkdir -p $HF_DATASETS_CACHE
-mkdir -p $UV_CACHE_DIR
 
 # Check if running in GitLab CI
 INTERACTIVE_FLAG=""
@@ -52,12 +50,10 @@ docker run -u root $INTERACTIVE_FLAG --ulimit memlock=-1 --ulimit stack=67108864
   -v "$PROJECT_ROOT:$PROJECT_ROOT" \
   -v $HF_HOME:/hf_home \
   -v $HF_DATASETS_CACHE:/hf_datasets_cache \
-  -v $UV_CACHE_DIR:/uv_cache \
   -e WANDB_API_KEY \
   -e HF_TOKEN \
   -e HF_HOME=/hf_home \
   -e HF_DATASETS_CACHE=/hf_datasets_cache \
-  -e UV_CACHE_DIR=/uv_cache \
   -e HOME=/tmp/ \
   -w $SCRIPT_DIR \
   "$CONTAINER" -- \