Skip to content

Commit 35bff37

Browse files
committed
add public Docker setup for ray cluster
Signed-off-by: Jack Luar <[email protected]>
1 parent 2eca2ac commit 35bff37

File tree

7 files changed

+268
-18
lines changed

7 files changed

+268
-18
lines changed

tools/AutoTuner/.gitignore

+4
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,7 @@ __pycache__/
1010
# Autotuner env
1111
autotuner_env
1212
.env
13+
14+
# Ray distributed
15+
public.yaml
16+
private.yaml
+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
DOCKERHUB_USERNAME={{DOCKERHUB_USERNAME}}
2+
DOCKERHUB_PASSWORD={{DOCKERHUB_PASSWORD}}
+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
ARG BASE_TAG
2+
FROM openroad/flow-ubuntu22.04-builder:${BASE_TAG:-latest}
3+
4+
# Install AT required packages
5+
RUN rm -rf ~/.cache/pip
6+
RUN pip3 cache purge
7+
RUN pip3 install --no-cache-dir -r /OpenROAD-flow-scripts/tools/AutoTuner/requirements.txt
8+
9+
# ORFS installation dir
10+
WORKDIR /OpenROAD-flow-scripts/tools/AutoTuner/src/autotuner

tools/AutoTuner/distributed/Makefile

+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
.PHONY: clean
2+
include .env
3+
export
4+
5+
init:
6+
@echo "Setting up environment..."
7+
@../installer.sh
8+
9+
clean:
10+
@echo "Cleaning up old images"
11+
@docker rmi orfs-autotuner:latest
12+
13+
base:
14+
@echo "Building base image..."
15+
@cd ../../../ && ./build_openroad.sh
16+
17+
docker:
18+
@echo "Building docker image..."
19+
@export BASE_TAG=$(shell cd ../../../ && ./etc/DockerTag.sh -dev) && \
20+
echo "Base image tag: $$BASE_TAG" && \
21+
docker build -t orfs-autotuner:latest -f Dockerfile . --build-arg BASE_TAG=$$BASE_TAG && \
22+
docker tag orfs-autotuner:latest orfs-autotuner:$$BASE_TAG
23+
24+
upload:
25+
@echo "Uploading docker image..."
26+
@docker login -u $(DOCKERHUB_USERNAME) -p $(DOCKERHUB_PASSWORD)
27+
@export BASE_TAG=$(shell cd ../../../ && ./etc/DockerTag.sh -dev) && \
28+
echo "Base image: $$BASE_TAG" && \
29+
docker tag orfs-autotuner:latest ${DOCKERHUB_USERNAME}/orfs-autotuner:$$BASE_TAG && \
30+
docker push ${DOCKERHUB_USERNAME}/orfs-autotuner:$$BASE_TAG
31+
@docker logout
32+
33+
up:
34+
@echo "Starting Ray cluster..."
35+
@. .venv/bin/activate && ray up -y public.yaml
36+
37+
down:
38+
@echo "Stopping Ray cluster..."
39+
@. .venv/bin/activate && ray down -y public.yaml

tools/AutoTuner/distributed/NOTES.md

+26
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
1) Setup two AT instances on same internal network
2+
2) Setup the requirements
3+
4+
```
5+
sudo apt-get install -y python3-pip python3-venv
6+
python3 -m venv .venv
7+
.venv/bin/activate && pip install ray[tune]
8+
9+
```
10+
11+
3) Common setup script
12+
- `at_distributed.sh`
13+
14+
4) Worker script
15+
- `at_worker.py`
16+
- `mkdir -p /tmp/owo && touch /tmp/owo/abc`
17+
18+
19+
5) Benchmark file transfers (do on worker)
20+
- Observation: sync_dir just makes sure the files are sync-ed. So neat feature is that only file diffs are transffered.
21+
- You do not have to create the dest_dir, sync_dir does that for you.
22+
- `max_size_bytes` is limited to 1GiB. So we have to lift up the restriction manually if needed.
23+
- Bottleneck seems to start at 1GiB transfers and above
24+
- `dd if=/dev/zero of=/tmp/owo/owo bs=1M count=100` - creates 100MB file. (Time taken: 2.2103039264678954 ± 0.556972017400803)
25+
- `dd if=/dev/zero of=/tmp/owo/owo bs=1M count=1000` - creates 1Gb file. (Time taken: 8.897777223587036 ± 0.6503669298689543)
26+
- `dd if=/dev/zero of=/tmp/owo/owo bs=1M count=5000` - creates 5Gb file. (Time taken: 54.920665216445926 ± 1.0533714623736783)

tools/AutoTuner/distributed/README.md

+41-18
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,49 @@
1-
1) Setup two AT instances on same internal network
2-
2) Setup the requirements
1+
# Ray Cluster Setup on Google Cloud Platform (GCP)
32

3+
This tutorial covers the setup of Ray Clusters on GCP. Ray Clusters are a way to
4+
start compute intensive jobs (e.g. Autotuner) on a distributed set of nodes spawned
5+
automatically. For more information on Ray Cluster, refer to [here](https://docs.ray.io/en/latest/cluster/getting-started.html).
6+
7+
To run Autotuner jobs on Ray Cluster, we have to first install ORFS onto the
8+
GCP nodes.
9+
10+
There are two different ways for ORFS setup on Ray Cluster, namely:
11+
- [Public](#public-cluster-setup): Upload Docker image to Dockerhub (or any public Docker registry).
12+
- [Private](#private-cluster-setup): Upload local code to Dockerhub, and re-compile on
13+
14+
## Prerequisites
15+
16+
Make sure Autotuner prerequisites are installed. To do so, refer to the installation script.
17+
18+
```bash
19+
make init
420
```
5-
sudo apt-get install -y python3-pip python3-venv
6-
python3 -m venv .venv
7-
.venv/bin/activate && pip install ray[tune]
821

22+
## Public cluster setup
23+
24+
1. Set up `.env` with Docker registry username/password. Also, set up the `public.yaml`
25+
file accordingly to your desired specifications.
26+
27+
```bash
28+
cp .env.sample .env
29+
cp public.yaml.template public.yaml
930
```
1031

11-
3) Common setup script
12-
- `at_distributed.sh`
32+
2. Run the following commands to build, tag and upload the public image:
33+
34+
```bash
35+
make clean
36+
make base
37+
make docker
38+
make upload
39+
```
1340

14-
4) Worker script
15-
- `at_worker.py`
16-
- `mkdir -p /tmp/owo && touch /tmp/owo/abc`
41+
3. Launch your cluster as follows:
42+
43+
```bash
44+
make up
45+
```
1746

47+
## Private cluster setup
1848

19-
5) Benchmark file transfers (do on worker)
20-
- Observation: sync_dir just makes sure the files are sync-ed. So neat feature is that only file diffs are transffered.
21-
- You do not have to create the dest_dir, sync_dir does that for you.
22-
- `max_size_bytes` is limited to 1GiB. So we have to lift up the restriction manually if needed.
23-
- Bottleneck seems to start at 1GiB transfers and above
24-
- `dd if=/dev/zero of=/tmp/owo/owo bs=1M count=100` - creates 100MB file. (Time taken: 2.2103039264678954 ± 0.556972017400803)
25-
- `dd if=/dev/zero of=/tmp/owo/owo bs=1M count=1000` - creates 1Gb file. (Time taken: 8.897777223587036 ± 0.6503669298689543)
26-
- `dd if=/dev/zero of=/tmp/owo/owo bs=1M count=5000` - creates 5Gb file. (Time taken: 54.920665216445926 ± 1.0533714623736783)
49+
Coming soon.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# An unique identifier for the head node and workers of this cluster.
2+
cluster_name: default
3+
4+
# The maximum number of workers nodes to launch in addition to the head
5+
# node.
6+
max_workers: 2
7+
8+
# The autoscaler will scale up the cluster faster with higher upscaling speed.
9+
# E.g., if the task requires adding more nodes then autoscaler will gradually
10+
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
11+
# This number should be > 0.
12+
upscaling_speed: 1.0
13+
14+
# This executes all commands on all nodes in the docker container,
15+
# and opens all the necessary ports to support the Ray cluster.
16+
# Empty string means disabled.
17+
docker:
18+
image: "orfs-autotuner:latest"
19+
container_name: "ray_container"
20+
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
21+
# if no cached version is present.
22+
pull_before_run: false
23+
24+
# If a node is idle for this many minutes, it will be removed.
25+
idle_timeout_minutes: 5
26+
27+
# Cloud-provider specific configuration.
28+
provider:
29+
type: gcp
30+
region: us-west1
31+
availability_zone: us-west1-a
32+
project_id: foss-fpga-tools-ext-openroad
33+
34+
# How Ray will authenticate with newly launched nodes.
35+
auth:
36+
ssh_user: ubuntu
37+
38+
available_node_types:
39+
ray_head_default:
40+
resources: {"CPU": 2}
41+
node_config:
42+
machineType: n1-standard-2
43+
disks:
44+
- boot: true
45+
autoDelete: true
46+
type: PERSISTENT
47+
initializeParams:
48+
diskSizeGb: 50
49+
sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
50+
ray_worker_small:
51+
# The minimum number of worker nodes of this type to launch.
52+
# This number should be >= 0.
53+
min_workers: 1
54+
# The maximum number of worker nodes of this type to launch.
55+
# This takes precedence over min_workers.
56+
max_workers: 2
57+
# The resources provided by this node type.
58+
resources: {"CPU": 2}
59+
node_config:
60+
machineType: n1-standard-2
61+
disks:
62+
- boot: true
63+
autoDelete: true
64+
type: PERSISTENT
65+
initializeParams:
66+
diskSizeGb: 50
67+
sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
68+
# scheduling:
69+
# - preemptible: true
70+
# Un-Comment this to launch workers with the Service Account of the Head Node
71+
# serviceAccounts:
72+
# - email: ray-autoscaler-sa-v1@<project_id>.iam.gserviceaccount.com
73+
# scopes:
74+
# - https://www.googleapis.com/auth/cloud-platform
75+
76+
# Specify the node type of the head node (as configured above).
77+
head_node_type: ray_head_default
78+
79+
# Files or directories to copy to the head and worker nodes. The format is a
80+
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
81+
file_mounts: {
82+
# "/path1/on/remote/machine": "/path1/on/local/machine",
83+
# "/path2/on/remote/machine": "/path2/on/local/machine",
84+
}
85+
86+
# Files or directories to copy from the head node to the worker nodes. The format is a
87+
# list of paths. The same path on the head node will be copied to the worker node.
88+
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
89+
# you should just use file_mounts. Only use this if you know what you're doing!
90+
cluster_synced_files: []
91+
92+
# Whether changes to directories in file_mounts or cluster_synced_files in the head node
93+
# should sync to the worker node continuously
94+
file_mounts_sync_continuously: False
95+
96+
# Patterns for files to exclude when running rsync up or rsync down
97+
rsync_exclude:
98+
- "**/.git"
99+
- "**/.git/**"
100+
101+
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
102+
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
103+
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
104+
rsync_filter:
105+
- ".gitignore"
106+
107+
initialization_commands:
108+
- curl -fsSL https://get.docker.com -o get-docker.sh
109+
- sudo sh get-docker.sh
110+
- sudo usermod -aG docker $USER
111+
- sudo systemctl restart docker -f
112+
113+
# List of shell commands to run to set up nodes.
114+
setup_commands: []
115+
# Note: if you're developing Ray, you probably want to create a Docker image that
116+
# has your Ray repo pre-cloned. Then, you can replace the pip installs
117+
# below with a git checkout <your_sha> (and possibly a recompile).
118+
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
119+
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
120+
# - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
121+
122+
123+
# Custom commands that will be run on the head node after common setup.
124+
head_setup_commands:
125+
- pip install google-api-python-client==1.7.8
126+
127+
# Custom commands that will be run on worker nodes after common setup.
128+
worker_setup_commands: []
129+
130+
# Command to start ray on the head node. You don't need to change this.
131+
head_start_ray_commands:
132+
- ray stop
133+
- >-
134+
ray start
135+
--head
136+
--port=6379
137+
--object-manager-port=8076
138+
--autoscaling-config=~/ray_bootstrap_config.yaml
139+
140+
# Command to start ray on worker nodes. You don't need to change this.
141+
worker_start_ray_commands:
142+
- ray stop
143+
- >-
144+
ray start
145+
--address=$RAY_HEAD_IP:6379
146+
--object-manager-port=8076

0 commit comments

Comments
 (0)