Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup Commands not running with Ray Cluster on GCP #46451

Closed
arandhaw opened this issue Jul 5, 2024 · 18 comments · Fixed by #49440
Closed

Setup Commands not running with Ray Cluster on GCP #46451

arandhaw opened this issue Jul 5, 2024 · 18 comments · Fixed by #49440
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-clusters For launching and managing Ray clusters/jobs/kubernetes P2 Important issue, but not time-critical

Comments

@arandhaw
Copy link

arandhaw commented Jul 5, 2024

What happened + What you expected to happen

I have been creating Ray Clusters on cloud VM's in Google Cloud. I've been having issues with the setup_commands in the ray cluster YAML file. These are supposed to run when new nodes are made.

The commands always run correctly on the head node. However, sometimes when new workers are created by the autoscaler, one or both of the worker nodes is not setup correctly. No errors appear in logs, but the worker is not set up correctly. It appears to work / stop working randomly.

The YAML file below is the configuration file I've been using. You'll need to change the in 3 places for your specific cloud project. I've been creating the clusters using ray up on the google cloud shell, then SSH'ing into the head node to run scripts. The error first started appearing when the autoscaler added more than one worker.

Versions / Dependencies

Ray most recent version.
The cluster is created from the google cloud shell.

Reproduction script

# An unique identifier for the head node and workers of this cluster.
cluster_name: gpu-cluster

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-east1
    availability_zone: "us-east1-c"
    project_id: <the project ID>  # Globally unique project id
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 2.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 20
# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu

# Tell the autoscaler the allowed node types and the resources they provide.
available_node_types:
    ray_head:
        # The resources provided by this node type.
        resources: {"CPU": 16}
        # Provider-specific config for the head node, e.g. instance type.
        node_config:
            machineType: n1-standard-16
            serviceAccounts:
              - email: "ray-autoscaler-sa-v1@<project name>.iam.gserviceaccount.com"
                scopes:
                 - "https://www.googleapis.com/auth/cloud-platform"
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231209-debian-11
            scheduling:
              - onHostMaintenance: TERMINATE

    ray_worker_gpu:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of workers nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 8, "GPU": 1}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-standard-8
            serviceAccounts:
              - email: "ray-autoscaler-sa-v1@<project-id>.iam.gserviceaccount.com"
                scopes:
                 - "https://www.googleapis.com/auth/cloud-platform"
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231209-debian-11
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            # Run workers on preemtible instance by default.
            # Comment this out to use on-demand
            guestAccelerators:
              - acceleratorType: nvidia-tesla-t4
                acceleratorCount: 1
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            scheduling:
              - preemptible: true
              - onHostMaintenance: TERMINATE
# Specify the node type of the head node (as configured above).
head_node_type: ray_head

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {}
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",


# run before setup commands (also outside of any docker containers)
initialization_commands:
 - 'echo "Setup Commands Started" >> /home/ubuntu/logs.txt 2>&1'

# List of shell commands to run to set up nodes.
setup_commands:
  - 'echo "Setup Commands Started" >> /home/ubuntu/logs.txt 2>&1'
  - "pip3 install torch >> /home/ubuntu/logs.txt 2>&1"
  - "pip3 install torchvision >> /home/ubuntu/logs.txt 2>&1"
  - "pip3 install Pillow >> /home/ubuntu/logs.txt 2>&1"
  - "pip3 install requests >> /home/ubuntu/logs.txt 2>&1"
  - "pip3 install Flask >> /home/ubuntu/logs.txt 2>&1"

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - 'echo "Head Commands Started" >> /home/ubuntu/logs.txt 2>&1'
  - "pip install google-api-python-client==1.7.8"

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands:
  - 'echo "Worker command Started" >> /home/ubuntu/logs.txt 2>&1'

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Issue Severity

High: It blocks me from completing my task.

@arandhaw arandhaw added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 5, 2024
@anyscalesam anyscalesam added core Issues that should be addressed in Ray Core core-clusters For launching and managing Ray clusters/jobs/kubernetes labels Jul 8, 2024
@jjyao
Copy link
Collaborator

jjyao commented Jul 8, 2024

@arandhaw when you say worker nodes is not setup correctly, what's they symptoms?

@jjyao jjyao added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 8, 2024
@g-goessel
Copy link

I am experiencing the same issue.
When I start 1 worker node at a time, it works fine.
If the autoscaler tries to start multiple nodes it will fail to run the setup commands. The result is that ray is not installed.

2024-07-09 11:57:41,321 INFO updater.py:452 -- [5/7] Initializing command runner
2024-07-09 11:57:41,321 INFO updater.py:498 -- [6/7] No setup commands to run.
2024-07-09 11:57:41,322 INFO updater.py:503 -- [7/7] Starting the Ray runtime
2024-07-09 11:57:41,322 VINFO command_runner.py:371 -- Running export RAY_OVERRIDE_RESOURCES='{"CPU":1}';export RAY_HEAD_IP=10.128.0.62; ray stop
2024-07-09 11:57:41,322 VVINFO command_runner.py:373 -- Full command is ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_24bf68e341/c022e6b155/%C -o ControlPersist=10s -o ConnectTimeout=120s [email protected] bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":1}'"'"';export RAY_HEAD_IP=10.128.0.62; ray stop)'

==> /tmp/ray/session_latest/logs/monitor.log <==
2024-07-09 11:57:41,488 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/animated-bit-413114/zones/us-central1-a/instances/ray-gcp-99498f66b357d8db-worker-5b1c305c-compute/setLabels?alt=json
2024-07-09 11:57:41,791 INFO node.py:348 -- wait_for_compute_zone_operation: Waiting for operation operation-1720526261534-61ccf3ca540ad-b590f7ab-89ab5f7d to finish...
2024-07-09 11:57:41,792 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/animated-bit-413114/zones/us-central1-a/operations/operation-1720526261534-61ccf3ca540ad-b590f7ab-89ab5f7d?alt=json
2024-07-09 11:57:46,930 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/animated-bit-413114/zones/us-central1-a/operations/operation-1720526261534-61ccf3ca540ad-b590f7ab-89ab5f7d?alt=json
2024-07-09 11:57:47,058 INFO node.py:367 -- wait_for_compute_zone_operation: Operation operation-1720526261534-61ccf3ca540ad-b590f7ab-89ab5f7d finished.

==> /tmp/ray/session_latest/logs/monitor.out <==
2024-07-09 11:57:47,058 ERR updater.py:171 -- New status: update-failed
2024-07-09 11:57:47,064 ERR updater.py:173 -- !!!
2024-07-09 11:57:47,064 VERR updater.py:183 -- Exception details: {'message': 'SSH command failed.'}
2024-07-09 11:57:47,065 ERR updater.py:185 -- Full traceback: Traceback (most recent call last):
File "/home/ubuntu/.local/share/pipx/venvs/ray/lib/python3.12/site-packages/ray/autoscaler/_private/updater.py", line 166, in run
self.do_update()
File "/home/ubuntu/.local/share/pipx/venvs/ray/lib/python3.12/site-packages/ray/autoscaler/_private/updater.py", line 531, in do_update
self.cmd_runner.run(
File "/home/ubuntu/.local/share/pipx/venvs/ray/lib/python3.12/site-packages/ray/autoscaler/_private/command_runner.py", line 379, in run
return self._run_helper(
^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/share/pipx/venvs/ray/lib/python3.12/site-packages/ray/autoscaler/_private/command_runner.py", line 298, in _run_helper
raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.

My setup_commands block is not empty, and it was successful in installing ray on the head node.

@g-goessel
Copy link

I think the setup commands are only run on the first worked that is being created.

I'm saying that because using ray monitor revealed that my setup command was being run on exactly on host, and all the others failed. Eventually, all the worker nodes have been configured.

@arandhaw
Copy link
Author

arandhaw commented Jul 10, 2024

@jjyao let me clarify exactly what seems to occur.
The problem occurs when the autoscaler creates new worker nodes.
What is supposed to happen is that during startup, the setup commands are run on the worker nodes.
Instead, what sometimes happens is that on some of the nodes, none of the setup commands are run.

Since I install dependencies in the setup commands (e.g., "pip install torch"), my ray jobs fail since none of the required libraries have been installed.

To be clear, the problem is not that the commands are failing and raising error messages. They are not being run at all.

@arandhaw
Copy link
Author

I may have solved mt problem. It turns out the version of ray installed on the head and worker nodes was 2.8.1, whereas the version on the cluster launcher was 2.32.0. I just assumed that ray would install itself on the head/worker nodes, but I think it was using an older version part of the VM image. By adding "pip install -U ray[all]" to the setup commands, it seems to have fixed the problem.

It would be nice if the documentation was clearer (or if a meaningful error message was given by ray).

@Gabriellgpc
Copy link

I'm having the same issue.
the new nodes are missing the setup commands.

Also using Ray Cluster on GCP VMs.

@Gabriellgpc
Copy link

Gabriellgpc commented Jul 25, 2024

pip install -U ray[all]

did not work for me.

@astron8t-voyagerx
Copy link
Contributor

I'm having same issue here.
The last clue I found is that this doesn't happen until version 2.6.3, and it happens from 2.7.0 onwards.

@Gabriellgpc
Copy link

Gabriellgpc commented Aug 21, 2024

What I'm doing to handle that for now, is use the "startup-script", a field from GCP
https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

In the node_config

node_config:
  machineType: custom-24-61440
  metadata:
        items:
          - key: startup-script
            value: |
              #!/bin/bash
              sudo mkdir -p /fiftyone/datasets
              <more commands here>

It's working as expected this way.

@astron8t-voyagerx
Copy link
Contributor

@Gabriellgpc Thank you for sharing! That's a good workaround, but I need the "file_mounts" option to sync files across the cluster and it seems hard to achieve that with startup scripts.
I've been struggling with this issue since version 2.7.0, but no fix has appeared yet.

@Gabriellgpc
Copy link

@astron8t-voyagerx I'm using the startup command to do something similar actually, because I want to share large folder between nodes, and the file mount do not handle that, I'm using the GCP buckets to store my large files and mounting it on all the nodes using the gsfuse command.
So my node_config looks like that,

            metadata:
              items:
                - key: startup-script
                  value: |
                    #!/bin/bash
                    sudo mkdir -p /fiftyone/datasets
                    sudo gcsfuse --implicit-dirs -o allow_other -file-mode=777 -dir-mode=777 --only-dir datasets fiftyone /fiftyone/datasets

@jjyao jjyao added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Oct 30, 2024
@hartikainen
Copy link
Contributor

hartikainen commented Nov 13, 2024

I'm observing something very similar to this. My setup is the following:

`cluster-config.yml`
[...]

provider:
  type: gcp
  region: us-east4
  availability_zone: us-east4-a
  project_id: [project-id]

[...]

cluster_synced_files: []
file_mounts_sync_continuously: false
rsync_exclude: []
rsync_filter:
  - ".gitignore"

initialization_commands: []

setup_commands:
  - cd ~/code && bazelisk run //:create_venv  # Use `rules_uv` to create `.venv` with pip packages.
  - # download ~200MB file
  - echo 'source ~/code/.venv/bin/activate' >> ~/.bashrc
  - python -m ensurepip && python -m pip install -U google-api-python-client

head_setup_commands: []
worker_setup_commands: []

head_start_ray_commands:
  - ray stop
  - >-
    ray start
    --head
    --port=6379
    --object-manager-port=8076
    --autoscaling-config=~/ray_bootstrap_config.yaml
    --disable-usage-stats

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
  - ray stop
  - >-
    ray start
    --address=$RAY_HEAD_IP:6379
    --object-manager-port=8076
    --disable-usage-stats
Snippet 1/3 of `/tmp/ray/session_latest/logs/monitor*`. Command 'ray' not found.
[...]
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

Command 'ray' not found, did you mean:
  command 'say' from deb gnustep-gui-runtime (0.30.0-3build3)
  command 'rar' from deb rar (2:6.23-1)
  command 'ra6' from deb ipv6toolkit (2.0+ds.1-2)
  command 'ra' from deb argus-client (1:3.0.8.2-6.2ubuntu1)
Try: sudo apt install <deb name>
2024-11-13 15:16:58,640 INFO log_timer.py:25 -- NodeUpdater: ray-cluster-tuomas-worker-9edc73c3-compute: Ray start commands failed [LogTimer=617ms]
2024-11-13 15:16:58,640 INFO log_timer.py:25 -- NodeUpdater: ray-cluster-tuomas-worker-9edc73c3-compute: Applied config 09ddcf9108642756fa08d76c833ac1f21e014382  [LogTimer=45322ms]
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
[...]
Snippet 2/3 of `/tmp/ray/session_latest/logs/monitor*`. Seems weird that there's "no initialization commands to run" and "no setup commands to run".
2024-11-13 15:14:49,324 VINFO command_runner.py:371 -- Running `uptime`                                                                         15:54:04 [351/1849]
2024-11-13 15:14:49,324 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/
null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_24bf68e3
41/aaa3e6b268/%C -o ControlPersist=10s -o ConnectTimeout=10s [email protected] bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignor
e && (uptime)'`
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

 15:14:50 up 7 min,  1 user,  load average: 0.00, 0.00, 0.00
2024-11-13 15:14:50,402 SUCC updater.py:295 -- Success.
2024-11-13 15:14:50,402 INFO log_timer.py:25 -- NodeUpdater: ray-cluster-test-worker-60b525fc-compute: Got remote shell  [LogTimer=6672ms]
2024-11-13 15:14:54,984 INFO updater.py:339 -- New status: waiting-for-ssh
2024-11-13 15:14:54,984 INFO updater.py:277 -- [1/7] Waiting for SSH to become available
2024-11-13 15:14:54,984 INFO updater.py:281 -- Running `uptime` as a test.
2024-11-13 15:14:54,984 INFO updater.py:389 -- Updating cluster configuration. [hash=09ddcf9108642756fa08d76c833ac1f21e014382]
2024-11-13 15:14:54,985 INFO command_runner.py:204 -- Fetched IP: 10.150.0.83
2024-11-13 15:14:54,990 INFO log_timer.py:25 -- NodeUpdater: ray-cluster-test-worker-61eab44a-compute: Got IP  [LogTimer=6ms]
2024-11-13 15:14:54,990 VINFO command_runner.py:371 -- Running `uptime`
2024-11-13 15:14:54,991 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/
null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_24bf68e3
41/aaa3e6b268/%C -o ControlPersist=10s -o ConnectTimeout=10s [email protected] bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignor
e && (uptime)'`
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

 15:14:56 up 7 min,  1 user,  load average: 0.16, 0.07, 0.03
2024-11-13 15:14:56,050 SUCC updater.py:295 -- Success.
2024-11-13 15:14:56,050 INFO log_timer.py:25 -- NodeUpdater: ray-cluster-test-worker-61eab44a-compute: Got remote shell  [LogTimer=1066ms]
2024-11-13 15:15:00,523 INFO updater.py:396 -- New status: syncing-files
2024-11-13 15:15:00,523 INFO updater.py:389 -- Updating cluster configuration. [hash=09ddcf9108642756fa08d76c833ac1f21e014382]2024-11-13 15:15:00,523   INFO update
r.py:254 -- [2/7] Processing file mounts

2024-11-13 15:15:00,528 INFO updater.py:271 -- [3/7] No worker file mounts to sync
2024-11-13 15:15:06,116 INFO updater.py:396 -- New status: syncing-files
2024-11-13 15:15:06,121 INFO updater.py:254 -- [2/7] Processing file mounts
2024-11-13 15:15:06,122 INFO updater.py:271 -- [3/7] No worker file mounts to sync
2024-11-13 15:15:11,733 INFO updater.py:407 -- New status: setting-up
2024-11-13 15:15:11,740 INFO updater.py:448 -- [4/7] No initialization commands to run.
2024-11-13 15:15:11,740 INFO updater.py:452 -- [5/7] Initializing command runner
2024-11-13 15:15:11,740 INFO updater.py:498 -- [6/7] No setup commands to run.
2024-11-13 15:15:11,740 INFO updater.py:503 -- [7/7] Starting the Ray runtime
Snippet 3/3 of `/tmp/ray/session_latest/logs/monitor*`. Messy error log ending in SSL erorr.
[... more logs ...]

Warning: Permanently added '10.150.0.48' (ED25519) to the list of known hosts.
Warning: Permanently added '10.150.0.65' (ED25519) to the list of known hosts.
Shared connection to 10.150.0.48 closed.
Shared connection to 10.150.0.65 closed.
Warning: Permanently added '10.150.0.65' (ED25519) to the list of known hosts.
Shared connection to 10.150.0.65 closed.
Exception in thread Exception in thread Thread-207 (spawn_updater):
Thread-204 (spawn_updater)Traceback (most recent call last):
:
  File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3.12/
threading.py", line 1075, in _bootstrap_inner
Traceback (most recent call last):
  File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3.12/
threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3.12/
threading.py", line 1012, in run
    self.run()    self._target(*self._args, **self._kwargs)

  File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3.12/
threading.py", line 1012, in run
  File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/ray/autoscaler/_private/autoscaler.py", line 1343, in spawn_updater
    updater = NodeUpdaterThread(
       self._target(*self._args, **self._kwargs) 
   File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/ray/autoscaler/_private/autoscaler.py", line 1343, in spawn_updater
        updater = NodeUpdaterThread(
           ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^
^^  File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/ray/autoscaler/_private/updater.py", line 569, in __init__
^^    ^NodeUpdater.__init__(self, *args, **kwargs)^
^  File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/ray/autoscaler/_private/updater.py", line 103, in __init__
^
  File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/ray/autoscaler/_private/updater.py", line 569, in __init__
    self.cmd_runner = provider.get_command_runner(
         NodeUpdater.__init__(self, *args, **kwargs) 
   File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/ray/autoscaler/_private/updater.py", line 103, in __init__
          self.cmd_runner = provider.get_command_runner( 
             ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^
^^^  File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 334, in get_command_runner
^^    ^instance = resource.get_instance(node_id)^
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
    File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 334, in get_command_runner
     instance = resource.get_instance(node_id) 
  ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^  File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/ray/autoscaler/_private/gcp/node.py", line 454, in get_instance
^^^    .execute()
 ^ ^ ^  ^^^^^^^^^^^^^^
^^
  File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/ray/autoscaler/_private/gcp/node.py", line 454, in get_instance
  File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
        .execute()return wrapped(*args, **kwargs)

          ^ ^ ^ ^ ^ ^ ^^^^^^
^^  File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
^^^    ^return wrapped(*args, **kwargs)^
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^
^  File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/googleapiclient/http.py", line 923, in execute
^^    ^resp, content = _retry_request(^
^ ^ ^ ^ ^ ^ ^ ^   ^ ^ ^ ^ ^ ^ ^  ^ 
   File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/googleapiclient/http.py", line 923, in execute
^^^    ^resp, content = _retry_request(^
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
   File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/googleapiclient/http.py", line 222, in _retry_request
          raise exception 
   File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/googleapiclient/http.py", line 191, in _retry_request
   File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/googleapiclient/http.py", line 191, in _retry_request
 ^^^    ^resp, content = http.request(uri, method, *args, **kwargs)^
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
     File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/googleapiclient/http.py", line 222, in _retry_request
       raise exception 
   File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/googleapiclient/http.py", line 191, in _retry_request
  ^    ^resp, content = http.request(uri, method, *args, **kwargs)^
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^  File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/google_auth_httplib2.py", line 218, in request
^^    ^response, content = self.http.request(^
^  ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
   File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/google_auth_httplib2.py", line 218, in request
 ^^    ^response, content = self.http.request(^
^ ^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
   File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/httplib2/__init__.py", line 1724, in request
         (response, content) = self._request( 
         ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
   File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/httplib2/__init__.py", line 1724, in request
  ^^    ^(response, content) = self._request(^
^  ^ ^ ^ ^ ^ ^ ^ ^ ^ 
   File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/httplib2/__init__.py", line 1444, in _request
         (response, content) = self._conn_request(conn, request_uri, method, body, headers) 
                ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
   File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/httplib2/__init__.py", line 1444, in _request
   ^^^^^    ^(response, content) = self._conn_request(conn, request_uri, method, body, headers)^
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^  File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/httplib2/__init__.py", line 1396, in _conn_request
^^^    ^response = conn.getresponse()^
^ ^^   ^^   ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^
^  File "/home/ubuntu/code/.venv/lib/python3.12/site-packages/httplib2/__init__.py", line 1396, in _conn_request
^^^^^^
  File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3.12/http/client.py", line 1428, in getresponse
    response = conn.getresponse()
        response.begin() 
   File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3.12
/http/client.py", line 331, in begin
            version, status, reason = self._read_status() 
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
   File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3.12
/http/client.py", line 1428, in getresponse
      response.begin() 
   File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3.12/http/client.py", line 331, in begin
         version, status, reason = self._read_status() 
  ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
      File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3
.12/http/client.py", line 292, in _read_status
        line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") 
  ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^
^  File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3.12
/http/client.py", line 292, in _read_status
^^^^    ^line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")^
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^
^  File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3.12
/socket.py", line 720, in readinto
^^^^^^^^^^^    ^^^^^^return self._sock.recv_into(b)
^^ ^ ^ ^ ^ ^ ^ ^ 
   File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3.12
/socket.py", line 720, in readinto
  ^^^^^    ^return self._sock.recv_into(b)^
^ ^^^ ^^ ^ ^ ^ ^ ^ ^^^^ ^ 
   File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3.12
/ssl.py", line 1251, in recv_into
^^^^    ^return self.read(nbytes, buffer)^
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^
^  File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3.12
/ssl.py", line 1251, in recv_into
^^^    ^return self.read(nbytes, buffer)^
         ^ ^^ ^^^^^^^^^^^^^^^^^^
^  File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3.12
/ssl.py", line 1103, in read
^^^^^^    ^return self._sslobj.read(len, buffer)^
^ ^ ^ ^ ^ ^ ^ ^ 
   File "/home/ubuntu/.cache/bazel/_bazel_ubuntu/da272e0d09e1d329650b5ce1f38bb920/external/rules_python~~python~python_3_12_x86_64-unknown-linux-gnu/lib/python3.12
/ssl.py", line 1103, in read
  ^    ^return self._sslobj.read(len, buffer)^
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^ssl^^^.^SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2570)^
^^^^^^^^
ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2570)
Warning: Permanently added '10.150.0.69' (ED25519) to the list of known hosts.
Shared connection to 10.150.0.69 closed.
Shared connection to 10.150.0.69 closed.
Shared connection to 10.150.0.69 closed.
Shared connection to 10.150.0.69 closed.
Shared connection to 10.150.0.69 closed.

[...more logs...]

Which makes me suspect that this is perhaps some sort of threading issue?

Note that some of my workers run exactly as expected. This seems to be happening more often the more nodes I have and about a quarter to half of my nodes setup fine when I run with 32 nodes. It feels like this is happening much less frequently when running with fewer nodes.

@richardliaw
Copy link
Contributor

@jjyao maybe we could do some investigation here? seems odd to have this reliability issue

@mangohehe
Copy link

same issue here, any updates?

@MortalHappiness
Copy link
Member

I am currently trying to handle this issue. I'll put simpler reproduction steps here as a note for myself.

cluster_name: ray-46451

provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-a
    project_id: <project_id>

max_workers: 12

upscaling_speed: 100.0

idle_timeout_minutes: 60
auth:
    ssh_user: ubuntu

available_node_types:
    ray_head_default:
        resources: {"CPU": 0}
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922

    ray_worker_small:
        min_workers: 0
        max_workers: 10
        resources: {"CPU": 2}
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
            scheduling:
              - preemptible: true

head_node_type: ray_head_default

file_mounts: {}
cluster_synced_files: []
file_mounts_sync_continuously: false
rsync_exclude: []
rsync_filter:
  - ".gitignore"

initialization_commands:
 - 'echo "Run initialization_commands" >> /tmp/logs.txt 2>&1'

# List of shell commands to run to set up nodes.
setup_commands:
  - 'echo "Run setup_commands" >> /tmp/logs.txt 2>&1'
  - pip install emoji

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - 'echo "Run head_setup_commands" >> /tmp/logs.txt 2>&1'
  - "pip install google-api-python-client==1.7.8"

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands:
  - 'echo "Run worker_setup_commands" >> /tmp/logs.txt 2>&1'

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Run ray up and ray attach. In head node run ray job submit -- python task.py

task.py

import time
import ray

ray.init()

@ray.remote(num_cpus=2)
def f():
  time.sleep(10)
  try:
    import emoji
    return True
  except ModuleNotFoundError:
    return False

futures = [f.remote() for i in range(10)]
print(ray.get(futures))

Expected: All True.
Actual: Some are False.

@hartikainen
Copy link
Contributor

Any luck with this? It seems like the issue can now be pretty reliably reproduced. Does anyone happen to have any workaround for this? I wonder if adding some delays/sleeps somewhere could help the issue.

@MortalHappiness
Copy link
Member

@hartikainen Did you try Ray version 2.38.0? I followed the steps I mentioned above and experimented by adding ray[default]==2.37.0 and ray[default]==2.38.0 in the setup_commands. It seems that the issue doesn't occur with version 2.38.0. You can try updating to 2.38.0 first, and if it works without issues, I'll go ahead and close this issue.

@hartikainen
Copy link
Contributor

hartikainen commented Dec 25, 2024

No, unfortunately, neither ray==2.38.0 nor ray==2.40.0 fixes this issue. I spent some time today looking into this and, as I mentioned above, I believe this is an issue caused by incorrect handling of google-api-python-client-library, which is thread-unsafe as per the documentation. I opened a draft PR at #49440 with a draft attempt to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-clusters For launching and managing Ray clusters/jobs/kubernetes P2 Important issue, but not time-critical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants