-
Notifications
You must be signed in to change notification settings - Fork 15
init_ray's runtime_env (with full os.environ) causes Ray runtime_env_agent to fail #309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for filing this bug @mrm-196 . So it sounds like there's something in your environment that's conflicting. Are you able to narrow it down to which key? The dafault behavior of forwarding all env vars is so that the UX is familiar to people who were using Aligner or are working locally. Open to feedback, but would like to know what env var causes this misconfiguration. |
Thanks for your response @terrykong . To further narrow things down, I ran the following modifications of
So, it seems that basically passing any non-empty dictionary of |
would you be able to share how ray cluster is deployed in your setup? is this a local one (where ray spins it up), or is this using our It seems others have observed this as well with ray on slurm. Could you try setting num_cpus to see if it resolves? |
Regarding ray cluster setup: Also, I did try and set
I basically ran the |
Thank you for providing more info. So to summarize:
Is the above correct? It's hard for us since we don't have an environment where we see this failure, but could you check if the failure is dependent on the value you set for so frequency of success if:
|
Thanks @terrykong ! I confirm that the summary is aligned with my previous observations. To further test things out, I rebuilt the hermetic container from the latest changes today and tested the scenarios that you asked on 4 different nodes with the same config:
In the scenarios above, single run is basically indicates running On the other hand, second run simply indicates the Further investigating this, I realized that in my env variables And I guess the reason that others also ran into this could be: 1 being the default value for |
One possible remedy could be something like this: ray_num_cpus = os.cpu_count()
if 'SLURM_JOB_CPUS_PER_NODE' in os.environ:
ray_num_cpus = min(ray_num_cpus, int(os.environ['SLURM_JOB_CPUS_PER_NODE'])) |
Thanks for the leading the investigation @mrm-196 . Let me do some testing on our clusters to validate and I'll PR something after I confirm on our end |
@mrm-196 Here's what ray status looks like when i launch a 2 node run: ray status
======== Autoscaler status: 2025-05-09 17:16:20.470517 ========
Node status
---------------------------------------------------------------
Active:
1 node_959f4eee0fa11962dd84dded250e8124871f8ad140bacd68e63f46c5
1 node_ed472538d56cbcfbfdbe822af5914d58089b154bde23acf377aee2f9
1 node_915732d4a62e993c0490ba643a0249b99263fef632f3d5cfa1cca3b2
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/256.0 CPU
0.0/16.0 GPU
0B/5.35TiB memory
0B/558.79GiB object_store_memory
0.0/16.0 worker_units
Demands:
(no resource demands) and when I run # env | grep CPU
SLURM_CPU_BIND=quiet,mask_cpu:0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_CPUS_ON_NODE=128
SLURM_JOB_CPUS_PER_NODE=128(x2)
SLURM_CPU_FREQ_REQ=Performance
SLURM_CPU_BIND_LIST=0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
SLURM_CPU_BIND_TYPE=mask_cpu:
NCCL_IGNORE_CPU_AFFINITY=0 If I muck with the
but the job's CPUs stays at 256, which makes sense given what I see from We currently assume the worker's --cpus-per-task=$((16 * gpus_per_node )), which we can parametrize since not everyone will have the same CPU as us, but I'm still at a loss why yours is 1 if all the workers are started up with |
@terrykong It seems that I added the wrong env variable name in my previous comment. Apologies for that. And in my case its value has always been much larger than 1. Also this env variable's value is not controlled by Correcting my previous comment: Looking at your posted results, it seems that our observations are aligned. |
@terrykong should we document these findings as part of some "best practice guide/things to note"? |
Thanks @terrykong , #410 looks good to me! In cases when the cluster is getting set up via |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
When running
examples/run_sft.py
, I observe ray failure during initialization. More specifically, the Rayruntime_env_agent
does not start (its log file,runtime_env_agent.log
, is missing), leading to a Raylet timeout (visible inraylet.out
).Terminal output:
Content of
raylet.out
:The problem is triggered when
nemo_rl.distributed.virtual_cluster.init_ray
callsray.init()
with aruntime_env
argument that includes"env_vars": dict(os.environ)
, attempting to pass all inherited shell environment variables to the Ray runtime.Steps/Code to reproduce bug
uv run python examples/run_sft.py
Workaround
Modifying the
init_ray
function innemo_rl/distributed/virtual_cluster.py
to callray.init(..., runtime_env=None, ...)
instead of passing the constructedruntime_env
dictionary (which includes the fullos.environ
) resolves this initial Ray startup problem and allows the script to proceed.Environment overview (please complete the following information)
The text was updated successfully, but these errors were encountered: