init_ray's runtime_env (with full os.environ) causes Ray runtime_env_agent to fail #309

mrm-196 · 2025-05-02T19:47:53Z

Describe the bug

When running examples/run_sft.py, I observe ray failure during initialization. More specifically, the Ray runtime_env_agent does not start (its log file, runtime_env_agent.log, is missing), leading to a Raylet timeout (visible in raylet.out).

Terminal output:

root@gcp5-sdc-28:/opt/nemo-rl# uv run python examples/run_sft.py
      Built nemo-rl @ file:///opt/nemo-rl
Uninstalled 1 package in 0.20ms
Installed 1 package in 0.43ms
Loaded configuration from: /opt/nemo-rl/examples/configs/sft.yaml
Applied CLI overrides
Final config:
{'checkpointing': {'checkpoint_dir': 'results/sft',
                   'enabled': True,
                   'higher_is_better': False,
                   'keep_top_k': 3,
                   'metric_name': 'val_loss',
                   'save_period': 10},
 'cluster': {'gpus_per_node': 1, 'num_nodes': 1},
 'data': {'add_bos': True,
          'add_eos': True,
          'dataset_name': 'squad',
          'max_input_seq_length': 1024},
 'logger': {'gpu_monitoring': {'collection_interval': 10, 'flush_interval': 10},
            'log_dir': 'logs',
            'monitor_gpus': False,
            'tensorboard': {'log_dir': 'tb_logs-sft-dev-squad'},
            'tensorboard_enabled': True,
            'wandb': {'name': 'sft-dev-squad', 'project': 'sft-dev'},
            'wandb_enabled': True},
 'policy': {'activation_checkpointing_enabled': False,
            'dtensor_cfg': {'activation_checkpointing': False,
                            'cpu_offload': False,
                            'enabled': True,
                            'sequence_parallel': False,
                            'tensor_parallel_size': 1},
            'fsdp_offload_enabled': False,
            'make_sequence_length_divisible_by': 1,
            'max_grad_norm': 1.0,
            'max_total_sequence_length': 1024,
            'model_name': 'meta-llama/Llama-3.2-1B',
            'optimizer': {'kwargs': {'betas': [0.9, 0.98],
                                     'eps': 1e-05,
                                     'foreach': False,
                                     'fused': False,
                                     'lr': 5e-06,
                                     'weight_decay': 0.1},
                          'name': 'torch.optim.AdamW'},
            'precision': 'float32',
            'tokenizer': {'chat_template': '{% for message in messages %}{%- '
                                           "if message['role'] == 'system'  "
                                           "%}{{'Context: ' + "
                                           "message['content'].strip()}}{%- "
                                           "elif message['role'] == 'user'  "
                                           "%}{{' Question: ' + "
                                           "message['content'].strip() + ' "
                                           "Answer:'}}{%- elif message['role'] "
                                           "== 'assistant'  %}{{' ' + "
                                           "message['content'].strip()}}{%- "
                                           'endif %}{% endfor %}',
                          'name': 'meta-llama/Llama-3.2-1B'},
            'train_global_batch_size': 32,
            'train_micro_batch_size': 1},
 'sft': {'max_num_epochs': 1,
         'max_num_steps': 60,
         'seed': 42,
         'val_at_start': True,
         'val_batches': 8,
         'val_global_batch_size': 32,
         'val_micro_batch_size': 1,
         'val_period': 10}}
📊 Using log directory: logs/exp_001
📊 Using checkpoint directory: results/sft
WARNING:root:UV_CACHE_DIR is not set, using default cache dir
2025-05-02 11:47:38,419	INFO worker.py:1841 -- Started a local Ray instance.
2025-05-02 11:47:38,581	INFO packaging.py:575 -- Creating a file package for local module '/opt/nemo-rl'.
2025-05-02 11:47:38,765	INFO packaging.py:367 -- Pushing file package 'gcs://_ray_pkg_bb970b41e052c966.zip' (2.61MiB) to Ray cluster...
2025-05-02 11:47:38,786	INFO packaging.py:380 -- Successfully pushed file package 'gcs://_ray_pkg_bb970b41e052c966.zip'.
[2025-05-02 11:48:11,492 E 2446910 2446910] core_worker.cc:499: Failed to register worker to Raylet: IOError: [RayletClient] Unable to register worker with raylet. Failed to read data from the socket: End of file worker_id=01000000ffffffffffffffffffffffffffffffffffffffffffffffff

Content of raylet.out:

root@gcp5-sdc-28:/opt/nemo-rl# cat /tmp/ray/session_latest/logs/raylet.out
[2025-05-02 11:47:38,371 I 2447254 2447254] (raylet) main.cc:226: Setting cluster ID to: 5363992693169b633aba60ae4e7ea2be2599c2417df24b6ed6d8ec52
[2025-05-02 11:47:38,379 I 2447254 2447254] (raylet) main.cc:341: Raylet is not set to kill unknown children.
[2025-05-02 11:47:38,379 I 2447254 2447254] (raylet) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2025-05-02 11:47:38,380 I 2447254 2447254] (raylet) main.cc:471: Setting node ID node_id=49f44dd865497e5ec34491d1544c64b1467df0bfeaf08ff5d5d49f59
[2025-05-02 11:47:38,380 I 2447254 2447254] (raylet) store_runner.cc:33: Allowing the Plasma store to use up to 200GB of memory.
[2025-05-02 11:47:38,380 I 2447254 2447254] (raylet) store_runner.cc:49: Starting object store with directory /dev/shm, fallback /tmp/ray, and huge page support disabled
[2025-05-02 11:47:38,380 I 2447254 2447284] (raylet) dlmalloc.cc:154: create_and_mmap_buffer(200000012296, /dev/shm/plasmaXXXXXX)
[2025-05-02 11:47:38,380 I 2447254 2447284] (raylet) store.cc:564: Plasma store debug dump: 
Current usage: 0 / 200 GB
- num bytes created total: 0
0 pending objects of total size 0MB
- objects spillable: 0
- bytes spillable: 0
- objects unsealed: 0
- bytes unsealed: 0
- objects in use: 0
- bytes in use: 0
- objects evictable: 0
- bytes evictable: 0

- objects created by worker: 0
- bytes created by worker: 0
- objects restored: 0
- bytes restored: 0
- objects received: 0
- bytes received: 0
- objects errored: 0
- bytes errored: 0

[2025-05-02 11:47:38,384 I 2447254 2447254] (raylet) grpc_server.cc:135: ObjectManager server started, listening on port 37975.
[2025-05-02 11:47:38,385 I 2447254 2447254] (raylet) worker_killing_policy.cc:101: Running GroupByOwner policy.
[2025-05-02 11:47:38,385 I 2447254 2447254] (raylet) memory_monitor.cc:48: MemoryMonitor initialized with usage threshold at 1734093045760 bytes (0.95 system memory), total system memory bytes: 1825361100800
[2025-05-02 11:47:38,385 I 2447254 2447254] (raylet) node_manager.cc:295: Initializing NodeManager node_id=49f44dd865497e5ec34491d1544c64b1467df0bfeaf08ff5d5d49f59
[2025-05-02 11:47:38,390 I 2447254 2447254] (raylet) grpc_server.cc:135: NodeManager server started, listening on port 33295.
[2025-05-02 11:47:38,392 I 2447254 2447254] (raylet) event.cc:500: Ray Event initialized for RAYLET
[2025-05-02 11:47:38,392 I 2447254 2447254] (raylet) event.cc:331: Set ray event level to warning
[2025-05-02 11:47:38,398 I 2447254 2447331] (raylet) agent_manager.cc:78: Monitor agent process with name dashboard_agent/424238335
[2025-05-02 11:47:38,402 I 2447254 2447333] (raylet) agent_manager.cc:78: Monitor agent process with name runtime_env_agent
[2025-05-02 11:47:38,406 I 2447254 2447254] (raylet) raylet.cc:134: Raylet of id, 49f44dd865497e5ec34491d1544c64b1467df0bfeaf08ff5d5d49f59 started. Raylet consists of node_manager and object_manager. node_manager address: 10.182.0.19:33295 object_manager address: 10.182.0.19:37975 hostname: gcp5-sdc-28
[2025-05-02 11:47:38,408 I 2447254 2447254] (raylet) node_manager.cc:532: [state-dump] NodeManager:
[state-dump] Node ID: 49f44dd865497e5ec34491d1544c64b1467df0bfeaf08ff5d5d49f59
[state-dump] Node name: 10.182.0.19
[state-dump] InitialConfigResources: {node:__internal_head__: 10000, object_store_memory: 2000000000000000, node:10.182.0.19: 10000, accelerator_type:H100: 10000, GPU: 80000, memory: 15222062438400000, CPU: 1800000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 49f44dd865497e5ec34491d1544c64b1467df0bfeaf08ff5d5d49f59 =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 0
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 0
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state: 
[state-dump] Local id: -8781352713923119511 Local resources: {"total":{CPU: [1800000], object_store_memory: [2000000000000000], memory: [15222062438400000], accelerator_type:H100: [10000], node:10.182.0.19: [10000], GPU: [10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000], node:__internal_head__: [10000]}}, "available": {CPU: [1800000], object_store_memory: [2000000000000000], memory: [15222062438400000], accelerator_type:H100: [10000], node:10.182.0.19: [10000], GPU: [10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000], node:__internal_head__: [10000]}}, "labels":{"ray.io/node_id":"49f44dd865497e5ec34491d1544c64b1467df0bfeaf08ff5d5d49f59",} is_draining: 0 is_idle: 1 Cluster resources: node id: -8781352713923119511{"total":{node:__internal_head__: 10000, object_store_memory: 2000000000000000, node:10.182.0.19: 10000, accelerator_type:H100: 10000, GPU: 80000, memory: 15222062438400000, CPU: 1800000}}, "available": {node:__internal_head__: 10000, object_store_memory: 2000000000000000, node:10.182.0.19: 10000, accelerator_type:H100: 10000, GPU: 80000, memory: 15222062438400000, CPU: 1800000}}, "labels":{"ray.io/node_id":"49f44dd865497e5ec34491d1544c64b1467df0bfeaf08ff5d5d49f59",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1} { "placment group locations": [], "node to bundles": []}
[state-dump] Waiting tasks size: 0
[state-dump] Number of executing tasks: 0
[state-dump] Number of pinned task arguments: 0
[state-dump] Number of total spilled tasks: 0
[state-dump] Number of spilled waiting tasks: 0
[state-dump] Number of spilled unschedulable tasks: 0
[state-dump] Resource usage {
[state-dump] }
[state-dump] Backlog Size per scheduling descriptor :{workerId: num backlogs}:
[state-dump] 
[state-dump] Running tasks by scheduling class:
[state-dump] ==================================================
[state-dump] 
[state-dump] ClusterResources:
[state-dump] LocalObjectManager:
[state-dump] - num pinned objects: 0
[state-dump] - pinned objects size: 0
[state-dump] - num objects pending restore: 0
[state-dump] - num objects pending spill: 0
[state-dump] - num bytes pending spill: 0
[state-dump] - num bytes currently spilled: 0
[state-dump] - cumulative spill requests: 0
[state-dump] - cumulative restore requests: 0
[state-dump] - spilled objects pending delete: 0
[state-dump] 
[state-dump] ObjectManager:
[state-dump] - num local objects: 0
[state-dump] - num unfulfilled push requests: 0
[state-dump] - num object pull requests: 0
[state-dump] - num chunks received total: 0
[state-dump] - num chunks received failed (all): 0
[state-dump] - num chunks received failed / cancelled: 0
[state-dump] - num chunks received failed / plasma error: 0
[state-dump] Event stats:
[state-dump] Global stats: 0 total (0 active)
[state-dump] Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] Execution time:  mean = -nan s, total = 0.000 s
[state-dump] Event stats:
[state-dump] PushManager:
[state-dump] - num pushes in flight: 0
[state-dump] - num chunks in flight: 0
[state-dump] - num chunks remaining: 0
[state-dump] - max chunks allowed: 409
[state-dump] OwnershipBasedObjectDirectory:
[state-dump] - num listeners: 0
[state-dump] - cumulative location updates: 0
[state-dump] - num location updates per second: 137430275016792000.000
[state-dump] - num location lookups per second: 137430275016768000.000
[state-dump] - num locations added per second: 0.000
[state-dump] - num locations removed per second: 0.000
[state-dump] BufferPool:
[state-dump] - create buffer state map size: 0
[state-dump] PullManager:
[state-dump] - num bytes available for pulled objects: 200000000000
[state-dump] - num bytes being pulled (all): 0
[state-dump] - num bytes being pulled / pinned: 0
[state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - task request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - first get request bundle: N/A
[state-dump] - first wait request bundle: N/A
[state-dump] - first task request bundle: N/A
[state-dump] - num objects queued: 0
[state-dump] - num objects actively pulled (all): 0
[state-dump] - num objects actively pulled / pinned: 0
[state-dump] - num bundles being pulled: 0
[state-dump] - num pull retries: 0
[state-dump] - max timeout seconds: 0
[state-dump] - max timeout request is already processed. No entry.
[state-dump] 
[state-dump] WorkerPool:
[state-dump] - registered jobs: 0
[state-dump] - process_failed_job_config_missing: 0
[state-dump] - process_failed_rate_limited: 0
[state-dump] - process_failed_pending_registration: 0
[state-dump] - process_failed_runtime_env_setup_failed: 0
[state-dump] - num PYTHON workers: 0
[state-dump] - num PYTHON drivers: 0
[state-dump] - num PYTHON pending start requests: 0
[state-dump] - num PYTHON pending registration requests: 0
[state-dump] - num object spill callbacks queued: 0
[state-dump] - num object restore queued: 0
[state-dump] - num util functions queued: 0
[state-dump] - num idle workers: 0
[state-dump] TaskDependencyManager:
[state-dump] - task deps map size: 0
[state-dump] - get req map size: 0
[state-dump] - wait req map size: 0
[state-dump] - local objects map size: 0
[state-dump] WaitManager:
[state-dump] - num active wait requests: 0
[state-dump] Subscriber:
[state-dump] Channel WORKER_REF_REMOVED_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_EVICTION
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] num async plasma notifications: 0
[state-dump] Remote node managers: 
[state-dump] Event stats:
[state-dump] Global stats: 26 total (13 active)
[state-dump] Queueing time: mean = 1.458 ms, max = 12.173 ms, min = 24.679 us, total = 37.900 ms
[state-dump] Execution time:  mean = 675.089 us, total = 17.552 ms
[state-dump] Event stats:
[state-dump] 	PeriodicalRunner.RunFnPeriodically - 11 total (2 active, 1 running), Execution time: mean = 107.866 us, total = 1.187 ms, Queueing time: mean = 2.323 ms, max = 7.295 ms, min = 24.679 us, total = 25.555 ms
[state-dump] 	ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 204.593 us, total = 204.593 us, Queueing time: mean = 12.173 ms, max = 12.173 ms, min = 12.173 ms, total = 12.173 ms
[state-dump] 	ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 1.718 ms, total = 1.718 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 13.269 ms, total = 13.269 ms, Queueing time: mean = 171.751 us, max = 171.751 us, min = 171.751 us, total = 171.751 us
[state-dump] 	NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 1.174 ms, total = 1.174 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.record_metrics - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] DebugString() time ms: 0
[state-dump] 
[state-dump] 
[2025-05-02 11:47:38,418 I 2447254 2447254] (raylet) accessor.cc:777: Received notification for node, IsAlive = 1 node_id=49f44dd865497e5ec34491d1544c64b1467df0bfeaf08ff5d5d49f59
[2025-05-02 11:47:38,804 I 2447254 2447254] (raylet) worker_pool.cc:729: [Eagerly] Start install runtime environment for job 01000000.
[2025-05-02 11:47:38,805 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447400, the token is 0
[2025-05-02 11:47:38,806 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447401, the token is 1
[2025-05-02 11:47:38,807 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447402, the token is 2
[2025-05-02 11:47:38,831 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447403, the token is 3
[2025-05-02 11:47:38,832 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447404, the token is 4
[2025-05-02 11:47:38,833 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447405, the token is 5
[2025-05-02 11:47:38,833 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447406, the token is 6
[2025-05-02 11:47:38,834 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447407, the token is 7
[2025-05-02 11:47:38,877 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447408, the token is 8
[2025-05-02 11:47:38,878 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447409, the token is 9
[2025-05-02 11:47:38,879 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447410, the token is 10
[2025-05-02 11:47:38,879 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447411, the token is 11
[2025-05-02 11:47:38,880 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447412, the token is 12
[2025-05-02 11:47:38,938 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447413, the token is 13
[2025-05-02 11:47:38,939 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447414, the token is 14
[2025-05-02 11:47:38,940 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447415, the token is 15
[2025-05-02 11:47:38,940 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447416, the token is 16
[2025-05-02 11:47:38,941 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447417, the token is 17
[2025-05-02 11:47:39,011 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447418, the token is 18
[2025-05-02 11:47:39,028 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447419, the token is 19
[2025-05-02 11:47:39,029 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447420, the token is 20
[2025-05-02 11:47:39,029 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447421, the token is 21
[2025-05-02 11:47:39,030 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447422, the token is 22
[2025-05-02 11:47:39,119 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447423, the token is 23
[2025-05-02 11:47:39,120 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447427, the token is 24
[2025-05-02 11:47:39,121 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447428, the token is 25
[2025-05-02 11:47:39,121 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447429, the token is 26
[2025-05-02 11:47:39,122 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447430, the token is 27
[2025-05-02 11:47:39,202 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447431, the token is 28
[2025-05-02 11:47:39,203 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447432, the token is 29
[2025-05-02 11:47:39,204 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447433, the token is 30
[2025-05-02 11:47:39,205 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447434, the token is 31
[2025-05-02 11:47:39,205 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447435, the token is 32
[2025-05-02 11:47:39,281 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447436, the token is 33
[2025-05-02 11:47:39,282 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447437, the token is 34
[2025-05-02 11:47:39,283 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447438, the token is 35
[2025-05-02 11:47:39,283 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447439, the token is 36
[2025-05-02 11:47:39,284 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447440, the token is 37
[2025-05-02 11:47:39,380 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447441, the token is 38
[2025-05-02 11:47:39,381 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447442, the token is 39
[2025-05-02 11:47:39,389 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447443, the token is 40
[2025-05-02 11:47:39,390 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447444, the token is 41
[2025-05-02 11:47:39,473 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447445, the token is 42
[2025-05-02 11:47:39,474 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447446, the token is 43
[2025-05-02 11:47:39,475 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447447, the token is 44
[2025-05-02 11:47:39,500 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447448, the token is 45
[2025-05-02 11:47:39,579 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447449, the token is 46
[2025-05-02 11:47:39,580 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447450, the token is 47
[2025-05-02 11:47:39,581 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447451, the token is 48
[2025-05-02 11:47:39,582 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447452, the token is 49
[2025-05-02 11:47:39,709 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447453, the token is 50
[2025-05-02 11:47:39,728 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447454, the token is 51
[2025-05-02 11:47:39,782 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447455, the token is 52
[2025-05-02 11:47:39,840 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447456, the token is 53
[2025-05-02 11:47:39,841 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447457, the token is 54
[2025-05-02 11:47:39,841 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447458, the token is 55
[2025-05-02 11:47:39,842 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447459, the token is 56
[2025-05-02 11:47:39,993 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447460, the token is 57
[2025-05-02 11:47:39,994 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447461, the token is 58
[2025-05-02 11:47:39,995 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447462, the token is 59
[2025-05-02 11:47:39,996 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447463, the token is 60
[2025-05-02 11:47:40,160 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447464, the token is 61
[2025-05-02 11:47:40,161 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447465, the token is 62
[2025-05-02 11:47:40,162 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447466, the token is 63
[2025-05-02 11:47:40,162 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447467, the token is 64
[2025-05-02 11:47:40,352 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447468, the token is 65
[2025-05-02 11:47:40,353 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447469, the token is 66
[2025-05-02 11:47:40,354 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447470, the token is 67
[2025-05-02 11:47:40,354 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447471, the token is 68
[2025-05-02 11:47:40,355 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447472, the token is 69
[2025-05-02 11:47:40,579 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447473, the token is 70
[2025-05-02 11:47:40,580 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447474, the token is 71
[2025-05-02 11:47:40,581 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447475, the token is 72
[2025-05-02 11:47:40,582 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447476, the token is 73
[2025-05-02 11:47:40,740 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447477, the token is 74
[2025-05-02 11:47:40,741 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447478, the token is 75
[2025-05-02 11:47:40,742 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447479, the token is 76
[2025-05-02 11:47:40,743 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447480, the token is 77
[2025-05-02 11:47:40,936 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447481, the token is 78
[2025-05-02 11:47:40,937 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447482, the token is 79
[2025-05-02 11:47:40,938 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447483, the token is 80
[2025-05-02 11:47:40,938 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447484, the token is 81
[2025-05-02 11:47:41,120 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447485, the token is 82
[2025-05-02 11:47:41,121 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447486, the token is 83
[2025-05-02 11:47:41,121 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447487, the token is 84
[2025-05-02 11:47:41,122 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447488, the token is 85
[2025-05-02 11:47:41,330 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447489, the token is 86
[2025-05-02 11:47:41,331 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447490, the token is 87
[2025-05-02 11:47:41,331 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447491, the token is 88
[2025-05-02 11:47:41,332 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447492, the token is 89
[2025-05-02 11:47:41,575 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447493, the token is 90
[2025-05-02 11:47:41,576 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447494, the token is 91
[2025-05-02 11:47:41,577 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447495, the token is 92
[2025-05-02 11:47:41,577 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447496, the token is 93
[2025-05-02 11:47:41,811 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447497, the token is 94
[2025-05-02 11:47:41,812 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447500, the token is 95
[2025-05-02 11:47:41,813 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447501, the token is 96
[2025-05-02 11:47:41,813 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447502, the token is 97
[2025-05-02 11:47:41,898 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447503, the token is 98
[2025-05-02 11:47:42,095 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447504, the token is 99
[2025-05-02 11:47:42,096 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447505, the token is 100
[2025-05-02 11:47:42,097 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447506, the token is 101
[2025-05-02 11:47:42,097 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447507, the token is 102
[2025-05-02 11:47:42,098 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447508, the token is 103
[2025-05-02 11:47:42,412 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447509, the token is 104
[2025-05-02 11:47:42,413 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447510, the token is 105
[2025-05-02 11:47:42,414 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447511, the token is 106
[2025-05-02 11:47:42,415 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447512, the token is 107
[2025-05-02 11:47:42,676 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447513, the token is 108
[2025-05-02 11:47:42,677 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447515, the token is 109
[2025-05-02 11:47:42,678 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447516, the token is 110
[2025-05-02 11:47:42,678 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447517, the token is 111
[2025-05-02 11:47:42,679 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447518, the token is 112
[2025-05-02 11:47:43,020 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447519, the token is 113
[2025-05-02 11:47:43,021 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447521, the token is 114
[2025-05-02 11:47:43,022 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447522, the token is 115
[2025-05-02 11:47:43,023 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447523, the token is 116
[2025-05-02 11:47:43,335 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447524, the token is 117
[2025-05-02 11:47:43,336 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447525, the token is 118
[2025-05-02 11:47:43,337 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447526, the token is 119
[2025-05-02 11:47:43,338 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447527, the token is 120
[2025-05-02 11:47:43,721 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447528, the token is 121
[2025-05-02 11:47:43,722 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447555, the token is 122
[2025-05-02 11:47:43,723 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447556, the token is 123
[2025-05-02 11:47:43,724 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447557, the token is 124
[2025-05-02 11:47:44,059 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447558, the token is 125
[2025-05-02 11:47:44,060 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447559, the token is 126
[2025-05-02 11:47:44,061 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447560, the token is 127
[2025-05-02 11:47:44,062 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447561, the token is 128
[2025-05-02 11:47:44,378 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447562, the token is 129
[2025-05-02 11:47:44,379 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447563, the token is 130
[2025-05-02 11:47:44,380 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447564, the token is 131
[2025-05-02 11:47:44,381 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447565, the token is 132
[2025-05-02 11:47:44,730 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447566, the token is 133
[2025-05-02 11:47:44,731 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447567, the token is 134
[2025-05-02 11:47:44,732 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447568, the token is 135
[2025-05-02 11:47:44,733 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447569, the token is 136
[2025-05-02 11:47:45,104 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447570, the token is 137
[2025-05-02 11:47:45,105 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447571, the token is 138
[2025-05-02 11:47:45,106 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447572, the token is 139
[2025-05-02 11:47:45,107 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447573, the token is 140
[2025-05-02 11:47:45,551 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447574, the token is 141
[2025-05-02 11:47:45,552 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447575, the token is 142
[2025-05-02 11:47:45,553 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447576, the token is 143
[2025-05-02 11:47:45,554 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447577, the token is 144
[2025-05-02 11:47:45,948 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447578, the token is 145
[2025-05-02 11:47:46,066 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447579, the token is 146
[2025-05-02 11:47:46,067 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447580, the token is 147
[2025-05-02 11:47:46,068 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447581, the token is 148
[2025-05-02 11:47:46,477 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447582, the token is 149
[2025-05-02 11:47:46,478 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447583, the token is 150
[2025-05-02 11:47:46,479 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447584, the token is 151
[2025-05-02 11:47:46,480 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447585, the token is 152
[2025-05-02 11:47:46,880 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447586, the token is 153
[2025-05-02 11:47:46,881 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447587, the token is 154
[2025-05-02 11:47:46,882 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447588, the token is 155
[2025-05-02 11:47:46,915 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447589, the token is 156
[2025-05-02 11:47:47,311 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447590, the token is 157
[2025-05-02 11:47:47,312 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447591, the token is 158
[2025-05-02 11:47:47,313 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447592, the token is 159
[2025-05-02 11:47:47,313 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447593, the token is 160
[2025-05-02 11:47:47,773 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447594, the token is 161
[2025-05-02 11:47:47,774 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447595, the token is 162
[2025-05-02 11:47:47,775 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447596, the token is 163
[2025-05-02 11:47:47,776 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447597, the token is 164
[2025-05-02 11:47:48,272 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447598, the token is 165
[2025-05-02 11:47:48,273 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447599, the token is 166
[2025-05-02 11:47:48,274 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447600, the token is 167
[2025-05-02 11:47:48,275 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447601, the token is 168
[2025-05-02 11:47:48,784 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447602, the token is 169
[2025-05-02 11:47:48,785 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447603, the token is 170
[2025-05-02 11:47:48,786 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447604, the token is 171
[2025-05-02 11:47:48,787 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447605, the token is 172
[2025-05-02 11:47:49,325 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447606, the token is 173
[2025-05-02 11:47:49,326 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447610, the token is 174
[2025-05-02 11:47:49,327 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447611, the token is 175
[2025-05-02 11:47:49,328 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447612, the token is 176
[2025-05-02 11:47:49,872 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447613, the token is 177
[2025-05-02 11:47:49,873 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447621, the token is 178
[2025-05-02 11:47:49,874 I 2447254 2447254] (raylet) worker_pool.cc:522: Started worker process with pid 2447622, the token is 179
[2025-05-02 11:47:49,874 W 2447254 2447254] (raylet) client_connection.cc:585: [worker]ProcessMessage with type RegisterClientRequest took 11070 ms.
[2025-05-02 11:47:49,876 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:47:50,168 W 2447254 2447278] (raylet) metric_exporter.cc:105: [1] Export metrics to agent failed: RpcError: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:62733: Failed to connect to remote host: Connection refused; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2025-05-02 11:47:50,880 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:47:51,946 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:47:52,950 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:47:53,957 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:47:54,997 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:47:56,007 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:47:57,037 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:47:58,071 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:47:59,082 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:48:00,138 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:48:01,144 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:48:02,153 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:48:03,186 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:48:04,192 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:48:05,226 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:48:06,258 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:48:07,289 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:48:08,308 I 2447254 2447254] (raylet) runtime_env_agent_client.cc:340: Runtime Env Agent network error: NotFound: on_connect Connection refused, the server may be still starting or is already failed. Scheduling a retry in 1000ms...
[2025-05-02 11:48:09,308 E 2447254 2447254] (raylet) runtime_env_agent_client.cc:335: Runtime Env Agent timed out in 30000ms. Status: NotFound: on_connect Connection refused, address: 10.182.0.19, port: 62042, existing immediately...
[2025-05-02 11:48:09,308 E 2447254 2447254] (raylet) runtime_env_agent_client.cc:291: The raylet exited immediately because the runtime env agent timed out when Raylet try to connect to it. This can happen because the runtime env agent was never started, or is listening to the wrong port. Read the log `cat /tmp/ray/session_latest/logs/runtime_env_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure.

[2025-05-02 11:48:09,308 I 2447254 2447254] (raylet) main.cc:307: Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION, reason message = Raylet could not connect to Runtime Env Agent
[2025-05-02 11:48:09,308 I 2447254 2447254] (raylet) main.cc:310: Shutting down...
[2025-05-02 11:48:09,308 I 2447254 2447254] (raylet) accessor.cc:511: Unregistering node node_id=49f44dd865497e5ec34491d1544c64b1467df0bfeaf08ff5d5d49f59
[2025-05-02 11:48:09,387 I 2447254 2447254] (raylet) accessor.cc:524: Finished unregistering node info, status = OK node_id=49f44dd865497e5ec34491d1544c64b1467df0bfeaf08ff5d5d49f59
[2025-05-02 11:48:09,854 I 2447254 2447254] (raylet) agent_manager.cc:113: Killing agent dashboard_agent/424238335, pid 2447330.
[2025-05-02 11:48:10,070 I 2447254 2447331] (raylet) agent_manager.cc:80: Agent process with name dashboard_agent/424238335 exited, exit code 0.
[2025-05-02 11:48:10,156 I 2447254 2447254] (raylet) agent_manager.cc:113: Killing agent runtime_env_agent, pid 2447332.
[2025-05-02 11:48:10,883 I 2447254 2447333] (raylet) agent_manager.cc:80: Agent process with name runtime_env_agent exited, exit code 0.
[2025-05-02 11:48:10,883 I 2447254 2447254] (raylet) io_service_pool.cc:47: IOServicePool is stopped.
[2025-05-02 11:48:11,021 I 2447254 2447254] (raylet) stats.h:120: Stats module has shutdown.

The problem is triggered when nemo_rl.distributed.virtual_cluster.init_ray calls ray.init() with a runtime_env argument that includes "env_vars": dict(os.environ), attempting to pass all inherited shell environment variables to the Ray runtime.

Steps/Code to reproduce bug

Built the hermetic container using the Dockerfile.
Ran the container in a slurm job
Run the SFT example script: uv run python examples/run_sft.py
Observe the script failing after the "Started a local Ray instance." message, with the "Failed to register worker to Raylet" error.

Workaround
Modifying the init_ray function in nemo_rl/distributed/virtual_cluster.py to call ray.init(..., runtime_env=None, ...) instead of passing the constructed runtime_env dictionary (which includes the full os.environ) resolves this initial Ray startup problem and allows the script to proceed.

Environment overview (please complete the following information)

Environment location: Docker/Slurm
Method of install: Built the hermetic container from source using these instructions.

The text was updated successfully, but these errors were encountered:

terrykong · 2025-05-02T23:19:54Z

Thanks for filing this bug @mrm-196 . So it sounds like there's something in your environment that's conflicting. Are you able to narrow it down to which key?

The dafault behavior of forwarding all env vars is so that the UX is familiar to people who were using Aligner or are working locally. Open to feedback, but would like to know what env var causes this misconfiguration.

mrm-196 · 2025-05-05T17:12:44Z

Thanks for your response @terrykong . To further narrow things down, I ran the following modifications of init_ray function:

Setting runtime_env to None entirely --> Everything works and training run proceeds.
Setting runtime_env to {"env_vars": {}} --> Everything works and training run proceeds.
Setting runtime_env to {"env_vars": {"MyDummyEnvVariable": "1"}} --> Ray's runtime_env_agent is not started and the run crashes.
Setting runtime_env to {"working_dir": git_root, "py_executable": PY_EXECUTABLES.SYSTEM} --> Everything works and training run proceeds.

So, it seems that basically passing any non-empty dictionary of env_vars results in a failure on my end.

terrykong · 2025-05-05T18:22:16Z

would you be able to share how ray cluster is deployed in your setup? is this a local one (where ray spins it up), or is this using our ray.sub script?

It seems others have observed this as well with ray on slurm. Could you try setting num_cpus to see if it resolves?

mrm-196 · 2025-05-05T19:09:07Z

Regarding ray cluster setup:
I'm currently running the NeMo-RL container interactively on 1 slurm node. And by running uv run python examples/run_sft.py, ray is initialized locally via init_ray function (not using ray.sub script).

Also, I did try and set num_cpus=os.cpu_count() and here is what I observed:

root@gcp5-sdc-30:/opt/nemo-rl# uv run python examples/run_sft.py
      Built nemo-rl @ file:///opt/nemo-rl
Uninstalled 1 package in 0.22ms
Installed 1 package in 0.42ms
Traceback (most recent call last):
  File "/opt/nemo-rl/examples/run_sft.py", line 24, in <module>
    from nemo_rl.algorithms.sft import MasterConfig, setup, sft_train
  File "/opt/nemo-rl/nemo_rl/algorithms/sft.py", line 36, in <module>
    from nemo_rl.distributed.virtual_cluster import ClusterConfig, RayVirtualCluster
  File "/opt/nemo-rl/nemo_rl/distributed/virtual_cluster.py", line 87
    num_cpus=os.cpu_count()
             ^^^^^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
root@gcp5-sdc-30:/opt/nemo-rl# vim nemo_rl/distributed/virtual_cluster.py 
root@gcp5-sdc-30:/opt/nemo-rl# uv run python examples/run_sft.py
Loaded configuration from: /opt/nemo-rl/examples/configs/sft.yaml
Applied CLI overrides
Final config:
{'checkpointing': {'checkpoint_dir': 'results/sft',
                   'enabled': True,
                   'higher_is_better': False,
                   'keep_top_k': 3,
                   'metric_name': 'val_loss',
                   'save_period': 10},
 'cluster': {'gpus_per_node': 1, 'num_nodes': 1},
 'data': {'add_bos': True,
          'add_eos': True,
          'dataset_name': 'squad',
          'max_input_seq_length': 1024},
 'logger': {'gpu_monitoring': {'collection_interval': 10, 'flush_interval': 10},
            'log_dir': 'logs',
            'monitor_gpus': False,
            'tensorboard': {'log_dir': 'tb_logs-sft-dev-squad'},
            'tensorboard_enabled': True,
            'wandb': {'name': 'sft-dev-squad', 'project': 'sft-dev'},
            'wandb_enabled': True},
 'policy': {'activation_checkpointing_enabled': False,
            'dtensor_cfg': {'activation_checkpointing': False,
                            'cpu_offload': False,
                            'enabled': True,
                            'sequence_parallel': False,
                            'tensor_parallel_size': 1},
            'fsdp_offload_enabled': False,
            'make_sequence_length_divisible_by': 1,
            'max_grad_norm': 1.0,
            'max_total_sequence_length': 1024,
            'model_name': 'meta-llama/Llama-3.2-1B',
            'optimizer': {'kwargs': {'betas': [0.9, 0.98],
                                     'eps': 1e-05,
                                     'foreach': False,
                                     'fused': False,
                                     'lr': 5e-06,
                                     'weight_decay': 0.1},
                          'name': 'torch.optim.AdamW'},
            'precision': 'float32',
            'tokenizer': {'chat_template': '{% for message in messages %}{%- '
                                           "if message['role'] == 'system'  "
                                           "%}{{'Context: ' + "
                                           "message['content'].strip()}}{%- "
                                           "elif message['role'] == 'user'  "
                                           "%}{{' Question: ' + "
                                           "message['content'].strip() + ' "
                                           "Answer:'}}{%- elif message['role'] "
                                           "== 'assistant'  %}{{' ' + "
                                           "message['content'].strip()}}{%- "
                                           'endif %}{% endfor %}',
                          'name': 'meta-llama/Llama-3.2-1B'},
            'train_global_batch_size': 32,
            'train_micro_batch_size': 1},
 'sft': {'max_num_epochs': 1,
         'max_num_steps': 60,
         'seed': 42,
         'val_at_start': True,
         'val_batches': 8,
         'val_global_batch_size': 32,
         'val_micro_batch_size': 1,
         'val_period': 10}}
📊 Using log directory: logs/exp_001
📊 Using checkpoint directory: results/sft
WARNING:root:UV_CACHE_DIR is not set, using default cache dir
2025-05-05 11:58:24,159	INFO worker.py:1841 -- Started a local Ray instance.
2025-05-05 11:58:24,334	INFO packaging.py:575 -- Creating a file package for local module '/opt/nemo-rl'.
2025-05-05 11:58:24,519	INFO packaging.py:367 -- Pushing file package 'gcs://_ray_pkg_40456044da366e1f.zip' (2.61MiB) to Ray cluster...
2025-05-05 11:58:24,542	INFO packaging.py:380 -- Successfully pushed file package 'gcs://_ray_pkg_40456044da366e1f.zip'.
[2025-05-05 11:58:56,728 E 3248664 3248664] core_worker.cc:499: Failed to register worker to Raylet: IOError: [RayletClient] Unable to register worker with raylet. Failed to read data from the socket: End of file worker_id=01000000ffffffffffffffffffffffffffffffffffffffffffffffff
root@gcp5-sdc-30:/opt/nemo-rl# uv run python examples/run_sft.py
Loaded configuration from: /opt/nemo-rl/examples/configs/sft.yaml
Applied CLI overrides
Final config:
{'checkpointing': {'checkpoint_dir': 'results/sft',
                   'enabled': True,
                   'higher_is_better': False,
                   'keep_top_k': 3,
                   'metric_name': 'val_loss',
                   'save_period': 10},
 'cluster': {'gpus_per_node': 1, 'num_nodes': 1},
 'data': {'add_bos': True,
          'add_eos': True,
          'dataset_name': 'squad',
          'max_input_seq_length': 1024},
 'logger': {'gpu_monitoring': {'collection_interval': 10, 'flush_interval': 10},
            'log_dir': 'logs',
            'monitor_gpus': False,
            'tensorboard': {'log_dir': 'tb_logs-sft-dev-squad'},
            'tensorboard_enabled': True,
            'wandb': {'name': 'sft-dev-squad', 'project': 'sft-dev'},
            'wandb_enabled': True},
 'policy': {'activation_checkpointing_enabled': False,
            'dtensor_cfg': {'activation_checkpointing': False,
                            'cpu_offload': False,
                            'enabled': True,
                            'sequence_parallel': False,
                            'tensor_parallel_size': 1},
            'fsdp_offload_enabled': False,
            'make_sequence_length_divisible_by': 1,
            'max_grad_norm': 1.0,
            'max_total_sequence_length': 1024,
            'model_name': 'meta-llama/Llama-3.2-1B',
            'optimizer': {'kwargs': {'betas': [0.9, 0.98],
                                     'eps': 1e-05,
                                     'foreach': False,
                                     'fused': False,
                                     'lr': 5e-06,
                                     'weight_decay': 0.1},
                          'name': 'torch.optim.AdamW'},
            'precision': 'float32',
            'tokenizer': {'chat_template': '{% for message in messages %}{%- '
                                           "if message['role'] == 'system'  "
                                           "%}{{'Context: ' + "
                                           "message['content'].strip()}}{%- "
                                           "elif message['role'] == 'user'  "
                                           "%}{{' Question: ' + "
                                           "message['content'].strip() + ' "
                                           "Answer:'}}{%- elif message['role'] "
                                           "== 'assistant'  %}{{' ' + "
                                           "message['content'].strip()}}{%- "
                                           'endif %}{% endfor %}',
                          'name': 'meta-llama/Llama-3.2-1B'},
            'train_global_batch_size': 32,
            'train_micro_batch_size': 1},
 'sft': {'max_num_epochs': 1,
         'max_num_steps': 60,
         'seed': 42,
         'val_at_start': True,
         'val_batches': 8,
         'val_global_batch_size': 32,
         'val_micro_batch_size': 1,
         'val_period': 10}}
📊 Using log directory: logs/exp_002
📊 Using checkpoint directory: results/sft
WARNING:root:UV_CACHE_DIR is not set, using default cache dir
2025-05-05 11:59:07,574	INFO worker.py:1841 -- Started a local Ray instance.
2025-05-05 11:59:07,746	INFO packaging.py:575 -- Creating a file package for local module '/opt/nemo-rl'.
2025-05-05 11:59:07,932	INFO packaging.py:367 -- Pushing file package 'gcs://_ray_pkg_40456044da366e1f.zip' (2.61MiB) to Ray cluster...
2025-05-05 11:59:07,962	INFO packaging.py:380 -- Successfully pushed file package 'gcs://_ray_pkg_40456044da366e1f.zip'.
INFO:nemo_rl.distributed.virtual_cluster:Started local cluster with: {'node:__internal_head__': 1.0, 'node:10.182.0.31': 1.0, 'CPU': 180.0, 'memory': 1523270368256.0, 'object_store_memory': 200000000000.0, 'GPU': 8.0, 'accelerator_type:H100': 1.0}

I basically ran the uv run python examples/run_sft.py twice. The first time led to a failure and the second time was successful. It's worth mentioning that I did not make any changes to the environment between these two runs.

terrykong · 2025-05-06T23:00:54Z

Thank you for providing more info. So to summarize:

cannot run at all without num_cpus=
num_cpus=os.cpu_count() runs sometimes

Is the above correct?

It's hard for us since we don't have an environment where we see this failure, but could you check if the failure is dependent on the value you set for num_cpus? For instance, it appears you have 180 CPU/hyperthreads could you try: num_cpus=os.cpu_count()//2 as well and share the probability of success between these settings?

so frequency of success if:

num_cpus not set
num_cpus set to max
num_cpus set to max//2

mrm-196 · 2025-05-08T23:15:16Z

Thanks @terrykong ! I confirm that the summary is aligned with my previous observations.

To further test things out, I rebuilt the hermetic container from the latest changes today and tested the scenarios that you asked on 4 different nodes with the same config:

num_cpus not set:
- single run: 0% (failed on all 4 nodes)
- second run: 100% (successful on all 4 nodes)
num_cpus set to os.cpu_count():
- single run: 0% (failed on all 4 nodes)
- second run: 100% (successful on all 4 nodes)
num_cpus set to os.cpu_count() // 2:
- single run: 0% (failed on all 4 nodes)
- second run: 75% (successful on 3 out of 4 nodes)

In the scenarios above, single run is basically indicates running uv run python examples/run_sft.py command once.

On the other hand, second run simply indicates the uv run python examples/run_sft.py command was run twice. And I only considered the outcome of second run. Between these two runs, nothing was changed.

Further investigating this, I realized that in my env variables SLURM_JOB_CPUS_PER_NODE was set to 1. And this prevented ray from using more CPU. Upon increasing it, I was able to see ray cluster being setup successfully as long the number of requested CPU, doesn't exceed it.

And I guess the reason that others also ran into this could be: 1 being the default value for --cpus-per-task when launching the SLURM job.

mrm-196 · 2025-05-08T23:15:32Z

One possible remedy could be something like this:

ray_num_cpus = os.cpu_count()
if 'SLURM_JOB_CPUS_PER_NODE' in os.environ:
    ray_num_cpus = min(ray_num_cpus, int(os.environ['SLURM_JOB_CPUS_PER_NODE']))

terrykong · 2025-05-09T03:03:07Z

Thanks for the leading the investigation @mrm-196 . Let me do some testing on our clusters to validate and I'll PR something after I confirm on our end

terrykong · 2025-05-10T00:55:48Z

@mrm-196 Here's what ray status looks like when i launch a 2 node run:

ray status
======== Autoscaler status: 2025-05-09 17:16:20.470517 ========
Node status
---------------------------------------------------------------
Active:
 1 node_959f4eee0fa11962dd84dded250e8124871f8ad140bacd68e63f46c5
 1 node_ed472538d56cbcfbfdbe822af5914d58089b154bde23acf377aee2f9
 1 node_915732d4a62e993c0490ba643a0249b99263fef632f3d5cfa1cca3b2
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/256.0 CPU
 0.0/16.0 GPU
 0B/5.35TiB memory
 0B/558.79GiB object_store_memory
 0.0/16.0 worker_units

Demands:
 (no resource demands)

and when I run $SLURM_JOB_ID-attach.sh I see this:

# env | grep CPU
SLURM_CPU_BIND=quiet,mask_cpu:0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_CPUS_ON_NODE=128
SLURM_JOB_CPUS_PER_NODE=128(x2)
SLURM_CPU_FREQ_REQ=Performance
SLURM_CPU_BIND_LIST=0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
SLURM_CPU_BIND_TYPE=mask_cpu:
NCCL_IGNORE_CPU_AFFINITY=0

If I muck with the --cpus-per-task in the attach script (example: --cpus-per-task=5, I'll get

# env | grep CPU
SLURM_CPU_BIND=quiet,mask_cpu:0x00000000000000070000000000000007
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_CPUS_ON_NODE=6
SLURM_JOB_CPUS_PER_NODE=128(x2)
SLURM_CPU_FREQ_REQ=Performance
SLURM_CPU_BIND_LIST=0x00000000000000070000000000000007
SLURM_CPU_BIND_TYPE=mask_cpu:
SLURM_CPUS_PER_TASK=5
NCCL_IGNORE_CPU_AFFINITY=0

but the job's CPUs stays at 256, which makes sense given what I see from ray status and that I launch with 2 nodes. But this means I can't strictly use SLURM_JOB_CPUS_PER_NODE. The fact that your SLURM_JOB_CPUS_PER_NODE=1 seems like something is off.

We currently assume the worker's --cpus-per-task=$((16 * gpus_per_node )), which we can parametrize since not everyone will have the same CPU as us, but I'm still at a loss why yours is 1 if all the workers are started up with 16 * gpus_per_node. Could you verify there's not something in your environment overriding the variable?

mrm-196 · 2025-05-12T22:48:01Z

@terrykong It seems that I added the wrong env variable name in my previous comment. Apologies for that.
You are correct, SLURM_JOB_CPUS_PER_NODE always reflects the total number of CPUs.

And in my case its value has always been much larger than 1. Also this env variable's value is not controlled by --cpus-per-task argument. Basically, if the job is launched with an --exclusive flag, then the SLURM_JOB_CPUS_PER_NODE should reflect the total number of CPU on the node(s). This is regardless of how many are available for use in the job.

Correcting my previous comment:
In my setup SLURM_CPUS_ON_NODE was set to 1 (not SLURM_JOB_CPUS_PER_NODE). And changing the --cpus-per-task allowed me to increase the number of CPU available for the training behind this env variable and get unblocked.

Looking at your posted results, it seems that our observations are aligned.

snowmanwwg · 2025-05-15T19:27:31Z

@terrykong should we document these findings as part of some "best practice guide/things to note"?

terrykong · 2025-05-18T02:00:06Z

hey @mrm-196 . circling back to this one.

can you take a look at #410

does that address your core issue?

mrm-196 · 2025-05-21T07:08:24Z

Thanks @terrykong , #410 looks good to me!

In cases when the cluster is getting set up via ray.sub, then the provided should modifications should do the job. Also, it's great that number of CPU/GPUs per node are now configurable.

mrm-196 added the bug Something isn't working label May 2, 2025

terrykong self-assigned this May 5, 2025

terrykong mentioned this issue May 21, 2025

feat: parametrize GPUS_PER_NODE and CPUS_PER_WORKER in ray.sub #410

Merged

SahilJain314 closed this as completed in #410 May 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

init_ray's runtime_env (with full os.environ) causes Ray runtime_env_agent to fail #309

init_ray's runtime_env (with full os.environ) causes Ray runtime_env_agent to fail #309

mrm-196 commented May 2, 2025 •

edited

Loading

terrykong commented May 2, 2025

Uh oh!

mrm-196 commented May 5, 2025

Uh oh!

terrykong commented May 5, 2025

Uh oh!

mrm-196 commented May 5, 2025

Uh oh!

terrykong commented May 6, 2025 •

edited

Loading

Uh oh!

mrm-196 commented May 8, 2025

Uh oh!

mrm-196 commented May 8, 2025

Uh oh!

terrykong commented May 9, 2025

Uh oh!

terrykong commented May 10, 2025

Uh oh!

mrm-196 commented May 12, 2025 •

edited

Loading

Uh oh!

snowmanwwg commented May 15, 2025

Uh oh!

terrykong commented May 18, 2025

Uh oh!

mrm-196 commented May 21, 2025 •

edited

Loading

Uh oh!

init_ray's runtime_env (with full os.environ) causes Ray runtime_env_agent to fail #309

init_ray's runtime_env (with full os.environ) causes Ray runtime_env_agent to fail #309

Comments

mrm-196 commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

terrykong commented May 2, 2025

Uh oh!

mrm-196 commented May 5, 2025

Uh oh!

terrykong commented May 5, 2025

Uh oh!

mrm-196 commented May 5, 2025

Uh oh!

terrykong commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrm-196 commented May 8, 2025

Uh oh!

mrm-196 commented May 8, 2025

Uh oh!

terrykong commented May 9, 2025

Uh oh!

terrykong commented May 10, 2025

Uh oh!

mrm-196 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snowmanwwg commented May 15, 2025

Uh oh!

terrykong commented May 18, 2025

Uh oh!

mrm-196 commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrm-196 commented May 2, 2025 •

edited

Loading

terrykong commented May 6, 2025 •

edited

Loading

mrm-196 commented May 12, 2025 •

edited

Loading

mrm-196 commented May 21, 2025 •

edited

Loading