choosehappy · nanli-emory · Jun 5, 2024 · Jun 5, 2024 · Jun 5, 2024
diff --git a/ray_cluster_launchers/Readme.md b/ray_cluster_launchers/Readme.md
@@ -0,0 +1,155 @@
+# Instruction of Launching Ray cluster on AWS, Azure, and GCP
+
+
+
+## Preparation - install Ray CLI
+Please use pip to intall the ray CLI on local environment
+```
+# install ray
+pip install -U ray[default]
+```
+<br>
+
+
+
+
+
+
+## Configure Ray Cluster laucher .yml files for AWS, Azure, and GCP
+
+All launcher template .yaml files are modified and based on Ray offical cluster config files:
+
+[aws-example-full.yaml](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml), [azure-example-full.yaml](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/azure/example-full.yaml), and [gcp-example-full.yaml](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/gcp/example-full.yaml)
+
+<br>
+
+### A. Configure Ray Cluster on AWS at Emory
+
+
+1. Install and Configure [Emory TKI CLI](https://it.emory.edu/tki/)
+
+2. Go to AWS Console and login
+
+3. Go to `EC2` > `Security Group` and create a security group for ray cluster and set `GroupName` at [line 50](./aws-ray-cluster-launcher-template.yaml#L50) 
+
+4. Go to `EC2` > `Key Pairs` and create key pair for ray cluster and set `keyName` at [line 59](./aws-ray-cluster-launcher-template.yaml#L59), [line 84](./aws-ray-cluster-launcher-template.yaml#L84) and [line 118](./aws-ray-cluster-launcher-template.yaml#L118).
+
+5. Go to `VPC` > `Subnets` and create subnet for cluster and set `SubnetIds` for ray header and worker nodes at [line 77](./aws-ray-cluster-launcher-template.yaml#L77) and [line 111](./aws-ray-cluster-launcher-template.yaml#L111) 
+set subnet   
+
+6. login AWS CLI
+
+### B. Configure Ray Cluster on Azure
+
+1. Install and Configure [the Azure CLI](https://cloud.google.com/sdk/docs/install)
+
+    ```
+    # Install azure cli and bundle.
+    pip install azure-cli azure-identity azure-mgmt azure-mgmt-network
+
+    # Login to azure. This will redirect you to your web browser.
+    az login
+    ```
+<br>
+
+2. Use `ssh-keygen -f </path/to/key-folder> -t rsa -b 4096` to generate a new ssh key pair for ray cluster laucher VM. Azure ray cluster laucher will use the key to control header and worker nodes later.
+    ```
+    # generate the ssh key pair.
+    ssh-keygen -f </path/to/key-folder> -t rsa -b 4096
+
+    ```
+<br>
+
+3. Modify and Configure Ray cluster launcher file for Azure
+   - On [line 64, and 66](./azure-ray-cluster-launcher-template.yaml#L64), point to the ssh key that you generate on your local path.
+   - On [line 119](./azure-ray-cluster-launcher-template.yaml#L119), mount the ssh public key to VMs. 
+<br>
+
+
+### C. Configure Ray Cluster on GCP
+
+1. Login and create GCP project and get \<gcp-project-id> on GCP Console. User need to modify `project_id` by using user's project If on [line 42](./gcp-ray-cluster-launcher-template.yaml#L42).
+
+<br>
+
+2. Go to **APIs and Services** panel to Enable the following APIs on GCP Console:
+   - Cloud Resource Manager API
+   - Compute Engine API
+   - Cloud OS Login API
+   - Identity and Access Management (IAM) API   
+
+<br>
+
+3. Generate a ssh key for your gcp project:
+    ```
+    ssh-keygen -t rsa -f </path/to/ssh-key-folder> -C <user-name> -b 2048
+    ```
+
+<br>
+
+4. Go to **Metadata** panel and click **SSH KEYS** tab to upload the public ssh key on GCP project. All instances in the project inherit these SSH keys. 
+
+<br>
+
+5. Modify `ssh_private_key` to point the ssh private key on [line 59](./gcp-ray-cluster-launcher-template.yaml#L59). Set `KeyName` in the head and worker node on [line 77](./gcp-ray-cluster-launcher-template.yaml#L77) and [line 113](./gcp-ray-cluster-launcher-template.yaml#L113).
+
+<br>
+
+6. Install and Configure [the gcloud CLI](https://cloud.google.com/sdk/docs/install)
+    ```
+    # install pre-requisites  
+    apt-get install apt-transport-https ca-certificates gnupg curl
+
+    # install gcp cli
+    apt-get install google-cloud-cli
+
+    # inital and config gcp
+    gcloud init
+
+    ```
+
+<br>
+
+GCP References:
+[How to add SSH keys to VMs](https://cloud.google.com/compute/docs/connect/add-ssh-keys#:~:text=existing%20SSH%20keys-,To%20add%20a%20public%20SSH%20key%20to,metadata%2C%20use%20the%20google_compute_project_metadata%20resource.&text=AAAAC3NzaC1lZDI1NTE5AAAAILg6UtHDNyMNAh0GjaytsJdrUxjtLy3APXqZfNZhvCeT%20test%20EOF%20%7D%20%7D-,If%20there%20are%20existing%20SSH%20keys%20in%20project%20metadata%2C%20you,the%20the%20Compute%20Engine%20API.) (step 5)
+
+
+
+
+
+
+
+## Start and Test Ray with the Ray cluster launcher
+It works by running the following commands from your local machine:
+```
+# Create or update the cluster
+ray up <your-ray-cluster-template-for-different-platform>.yaml
+
+# Get a remote screen on the head node.
+ray attach <your-ray-cluster-template-for-different-platform>.yaml
+
+# Try running a Ray program.
+python -c 'import ray; ray.init()'
+exit
+
+# Tear down the cluster.
+ray down <your-ray-cluster-template-for-different-platform>.yaml
+```
+
+![Test screenshot](./images/test_screenshot.png)
+
+**After Ray cluster up successfully, users should be able to check the running ray clusters on different platform console.**
+
+**For AWS at Emory:**
+![AWS screenshot](./images/aws_instances.png)
+
+<br>
+
+
+**For Azure portal:**
+![azure screenshot](./images/azure_portal.png)
+
+<br>
+
+**For GCP Console:**
+![GCP screenshot](./images/gcp_vms.png)
diff --git a/ray_cluster_launchers/aws-ray-cluster-launcher-template.yaml b/ray_cluster_launchers/aws-ray-cluster-launcher-template.yaml
@@ -0,0 +1,199 @@
+# An unique identifier for the head node and workers of this cluster.
+cluster_name: aws-ray-cluster
+
+# The maximum number of workers nodes to launch in addition to the head
+# node.
+max_workers: 2
+
+# The autoscaler will scale up the cluster faster with higher upscaling speed.
+# E.g., if the task requires adding more nodes then autoscaler will gradually
+# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
+# This number should be > 0.
+upscaling_speed: 1.0
+
+# This executes all commands on all nodes in the docker container,
+# and opens all the necessary ports to support the Ray cluster.
+# Empty string means disabled.
+docker:
+    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
+    # image: rayproject/ray:latest-cpu   # use this one if you don't need ML dependencies, it's faster to pull
+    container_name: "ray_container"
+    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
+    # if no cached version is present.
+    pull_before_run: True
+    run_options:   # Extra options to pass into "docker run"
+        - --ulimit nofile=65536:65536
+
+    # Example of running a GPU head with CPU workers
+    # head_image: "rayproject/ray-ml:latest-gpu"
+    # Allow Ray to automatically detect GPUs
+
+    # worker_image: "rayproject/ray-ml:latest-cpu"
+    # worker_run_options: []
+
+# If a node is idle for this many minutes, it will be removed.
+idle_timeout_minutes: 5
+
+# Cloud-provider specific configuration.
+provider:
+    type: aws
+    region: us-east-1
+    # Availability zone(s), comma-separated, that nodes may be launched in.
+    # Nodes will be launched in the first listed availability zone and will
+    # be tried in the subsequent availability zones if launching fails.
+    # availability_zone: us-east-1a,us-east-1b
+    # Whether to allow node reuse. If set to False, nodes will be terminated
+    # instead of stopped.
+    cache_stopped_nodes: False # If not present, the default is True.
+    use_internal_ips: True
+    security_group:
+        GroupName: <aws-security-group-name>
+
+
+# How Ray will authenticate with newly launched nodes.
+auth:
+    ssh_user: <aws-ssh-user-name>
+# By default Ray creates a new private keypair, but you can also use your own.
+# If you do so, make sure to also set "KeyName" in the head and worker node
+# configurations below.
+    ssh_private_key: <path/to/aws-ssh-private-key>
+
+# Tell the autoscaler the allowed node types and the resources they provide.
+# The key is the name of the node type, which is just for debugging purposes.
+# The node config specifies the launch config and physical instance type.
+available_node_types:
+    head_node:
+        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
+        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
+        # You can also set custom resources.
+        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
+        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
+        # resources: {}
+        # Provider-specific config for this node type, e.g. instance type. By default
+        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
+        # For more documentation on available fields, see:
+        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
+        node_config:
+            SubnetIds: 
+                - <aws-subnet-id>
+            InstanceType: m5.large
+            # Default AMI for us-west-2.
+            # Check https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py
+            # for default images for other zones.
+            ImageId: ami-07caf09b362be10b8
+            KeyName: <aws-ssh-private-key-file-name>
+            # SecurityGroups: [public-ecg-group]
+            # You can provision additional disk space with a conf as follows
+            BlockDeviceMappings:
+                - DeviceName: /dev/xvda
+                  Ebs:
+                      VolumeSize: 150
+                      VolumeType: gp3
+            # Additional options in the boto docs.
+    worker_nodes:
+        # The minimum number of worker nodes of this type to launch.
+        # This number should be >= 0.
+        min_workers: 1
+        # The maximum number of worker nodes of this type to launch.
+        # This takes precedence over min_workers.
+        max_workers: 2
+        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
+        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
+        # You can also set custom resources.
+        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
+        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
+        # resources: {}
+        # Provider-specific config for this node type, e.g. instance type. By default
+        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
+        # For more documentation on available fields, see:
+        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
+        node_config:
+            SubnetIds: 
+                - <aws-subnet-id>
+            InstanceType: m5.large
+            # Default AMI for us-west-2.
+            # Check https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py
+            # for default images for other zones.
+            ImageId: ami-07caf09b362be10b8
+            KeyName: <aws-ssh-private-key-file-name>
+            # SecurityGroups: [public-ecg-group]
+            #     - public-ecg-group
+            # Run workers on spot by default. Comment this out to use on-demand.
+            # NOTE: If relying on spot instances, it is best to specify multiple different instance
+            # types to avoid interruption when one instance type is experiencing heightened demand.
+            # Demand information can be found at https://aws.amazon.com/ec2/spot/instance-advisor/
+            BlockDeviceMappings:
+                - DeviceName: /dev/xvda
+                  Ebs:
+                      VolumeSize: 150
+                      VolumeType: gp3
+            # InstanceMarketOptions:
+            #     MarketType: spot
+                # Additional options can be found in the boto docs, e.g.
+                #   SpotOptions:
+                #       MaxPrice: MAX_HOURLY_PRICE
+            # Additional options in the boto docs.
+
+# Specify the node type of the head node (as configured above).
+head_node_type: head_node
+
+# Files or directories to copy to the head and worker nodes. The format is a
+# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
+file_mounts: {
+#    "/path1/on/remote/machine": "/path1/on/local/machine",
+#    "/path2/on/remote/machine": "/path2/on/local/machine",
+}
+
+# Files or directories to copy from the head node to the worker nodes. The format is a
+# list of paths. The same path on the head node will be copied to the worker node.
+# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
+# you should just use file_mounts. Only use this if you know what you're doing!
+cluster_synced_files: []
+
+# Whether changes to directories in file_mounts or cluster_synced_files in the head node
+# should sync to the worker node continuously
+file_mounts_sync_continuously: False
+
+# Patterns for files to exclude when running rsync up or rsync down
+rsync_exclude:
+    - "**/.git"
+    - "**/.git/**"
+
+# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
+# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
+# as a value, the behavior will match git's behavior for finding and using .gitignore files.
+rsync_filter:
+    - ".gitignore"
+
+# List of commands that will be run before `setup_commands`. If docker is
+# enabled, these commands will run outside the container and before docker
+# is setup.
+initialization_commands: []
+
+# List of shell commands to run to set up nodes.
+setup_commands:
+    - sleep 4
+    - sudo yum install -y python3-pip python-is-python3
+    - pip3 install ray[default] boto3 torch
+    # Note: if you're developing Ray, you probably want to create a Docker image that
+    # has your Ray repo pre-cloned. Then, you can replace the pip installs
+    # below with a git checkout <your_sha> (and possibly a recompile).
+    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
+    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
+    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
+
+# Custom commands that will be run on the head node after common setup.
+head_setup_commands: []
+
+# Custom commands that will be run on worker nodes after common setup.
+worker_setup_commands: []
+
+# Command to start ray on the head node. You don't need to change this.
+head_start_ray_commands:
+    - ray stop
+    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0
+
+# Command to start ray on worker nodes. You don't need to change this.
+worker_start_ray_commands:
+    - ray stop
+    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076