-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setup Commands not running with Ray Cluster on GCP #46451
Comments
@arandhaw when you say worker nodes is not setup correctly, what's they symptoms? |
I am experiencing the same issue.
My |
I think the setup commands are only run on the first worked that is being created. I'm saying that because using |
@jjyao let me clarify exactly what seems to occur. Since I install dependencies in the setup commands (e.g., "pip install torch"), my ray jobs fail since none of the required libraries have been installed. To be clear, the problem is not that the commands are failing and raising error messages. They are not being run at all. |
I may have solved mt problem. It turns out the version of ray installed on the head and worker nodes was 2.8.1, whereas the version on the cluster launcher was 2.32.0. I just assumed that ray would install itself on the head/worker nodes, but I think it was using an older version part of the VM image. By adding "pip install -U ray[all]" to the setup commands, it seems to have fixed the problem. It would be nice if the documentation was clearer (or if a meaningful error message was given by ray). |
I'm having the same issue. Also using Ray Cluster on GCP VMs. |
did not work for me. |
I'm having same issue here. |
What I'm doing to handle that for now, is use the "startup-script", a field from GCP In the node_config node_config:
machineType: custom-24-61440
metadata:
items:
- key: startup-script
value: |
#!/bin/bash
sudo mkdir -p /fiftyone/datasets
<more commands here> It's working as expected this way. |
@Gabriellgpc Thank you for sharing! That's a good workaround, but I need the "file_mounts" option to sync files across the cluster and it seems hard to achieve that with startup scripts. |
@astron8t-voyagerx I'm using the startup command to do something similar actually, because I want to share large folder between nodes, and the file mount do not handle that, I'm using the GCP buckets to store my large files and mounting it on all the nodes using the gsfuse command.
|
I'm observing something very similar to this. My setup is the following: `cluster-config.yml`[...]
provider:
type: gcp
region: us-east4
availability_zone: us-east4-a
project_id: [project-id]
[...]
cluster_synced_files: []
file_mounts_sync_continuously: false
rsync_exclude: []
rsync_filter:
- ".gitignore"
initialization_commands: []
setup_commands:
- cd ~/code && bazelisk run //:create_venv # Use `rules_uv` to create `.venv` with pip packages.
- # download ~200MB file
- echo 'source ~/code/.venv/bin/activate' >> ~/.bashrc
- python -m ensurepip && python -m pip install -U google-api-python-client
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- >-
ray start
--head
--port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
--disable-usage-stats
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- >-
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
--disable-usage-stats Snippet 1/3 of `/tmp/ray/session_latest/logs/monitor*`. Command 'ray' not found.
Snippet 2/3 of `/tmp/ray/session_latest/logs/monitor*`. Seems weird that there's "no initialization commands to run" and "no setup commands to run".
Snippet 3/3 of `/tmp/ray/session_latest/logs/monitor*`. Messy error log ending in SSL erorr.
Which makes me suspect that this is perhaps some sort of threading issue? Note that some of my workers run exactly as expected. This seems to be happening more often the more nodes I have and about a quarter to half of my nodes setup fine when I run with 32 nodes. It feels like this is happening much less frequently when running with fewer nodes. |
@jjyao maybe we could do some investigation here? seems odd to have this reliability issue |
same issue here, any updates? |
I am currently trying to handle this issue. I'll put simpler reproduction steps here as a note for myself. cluster_name: ray-46451
provider:
type: gcp
region: us-west1
availability_zone: us-west1-a
project_id: <project_id>
max_workers: 12
upscaling_speed: 100.0
idle_timeout_minutes: 60
auth:
ssh_user: ubuntu
available_node_types:
ray_head_default:
resources: {"CPU": 0}
node_config:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
ray_worker_small:
min_workers: 0
max_workers: 10
resources: {"CPU": 2}
node_config:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
scheduling:
- preemptible: true
head_node_type: ray_head_default
file_mounts: {}
cluster_synced_files: []
file_mounts_sync_continuously: false
rsync_exclude: []
rsync_filter:
- ".gitignore"
initialization_commands:
- 'echo "Run initialization_commands" >> /tmp/logs.txt 2>&1'
# List of shell commands to run to set up nodes.
setup_commands:
- 'echo "Run setup_commands" >> /tmp/logs.txt 2>&1'
- pip install emoji
# Custom commands that will be run on the head node after common setup.
head_setup_commands:
- 'echo "Run head_setup_commands" >> /tmp/logs.txt 2>&1'
- "pip install google-api-python-client==1.7.8"
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands:
- 'echo "Run worker_setup_commands" >> /tmp/logs.txt 2>&1'
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- >-
ray start
--head
--port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- >-
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076 Run
import time
import ray
ray.init()
@ray.remote(num_cpus=2)
def f():
time.sleep(10)
try:
import emoji
return True
except ModuleNotFoundError:
return False
futures = [f.remote() for i in range(10)]
print(ray.get(futures)) Expected: All |
Any luck with this? It seems like the issue can now be pretty reliably reproduced. Does anyone happen to have any workaround for this? I wonder if adding some delays/sleeps somewhere could help the issue. |
@hartikainen Did you try Ray version 2.38.0? I followed the steps I mentioned above and experimented by adding |
No, unfortunately, neither |
What happened + What you expected to happen
I have been creating Ray Clusters on cloud VM's in Google Cloud. I've been having issues with the setup_commands in the ray cluster YAML file. These are supposed to run when new nodes are made.
The commands always run correctly on the head node. However, sometimes when new workers are created by the autoscaler, one or both of the worker nodes is not setup correctly. No errors appear in logs, but the worker is not set up correctly. It appears to work / stop working randomly.
The YAML file below is the configuration file I've been using. You'll need to change the in 3 places for your specific cloud project. I've been creating the clusters using ray up on the google cloud shell, then SSH'ing into the head node to run scripts. The error first started appearing when the autoscaler added more than one worker.
Versions / Dependencies
Ray most recent version.
The cluster is created from the google cloud shell.
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: