Add AMD examples with vLLM, Axolotl and Trl (#1693)

Add llama31-service-vllm-amd example [Docs] Added vLLM with AMD example Add AMD examples Update AMD-Axolotl example with official example Add build wheel tasks for AMD examples - Minor updates Add necessary comments in AMD Readme Co-authored-by: Bihan Rana <[email protected]>
dstackai · Sep 28, 2024 · cf0a139 · cf0a139
1 parent f464b7b
commit cf0a139
Show file tree

Hide file tree

Showing 12 changed files with 531 additions and 24 deletions.
diff --git a/examples/accelerators/amd/README.md b/examples/accelerators/amd/README.md
@@ -7,34 +7,40 @@ you can specify an AMD GPU under `resources`. Below are a few examples.
 
 ## Deployment
 
-### Running as a service
+You can use any serving framework, such as TGI and vLLM. Here's an example of a [service](https://dstack.ai/docs/services) that deploys 
+Llama 3.1 70B in FP16 using [TGI :material-arrow-top-right-thin:{ .external }](https://huggingface.co/docs/text-generation-inference/en/installation_amd){:target="_blank"} and [vLLM :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/getting_started/amd-installation.html){:target="_blank"}.
 
 === "TGI"
-    Here's an example of a [service](https://dstack.ai/docs/services) that deploys
-    Llama 3.1 70B in FP16 using [TGI :material-arrow-top-right-thin:{ .external }](https://huggingface.co/docs/text-generation-inference/en/installation_amd){:target="_blank"}.
 
     <div editor-title="examples/deployment/tgi/amd/service.dstack.yml"> 
 
     ```yaml
     type: service
     name: amd-service-tgi
 
+    # Using the official TGI's ROCm Docker image
     image: ghcr.io/huggingface/text-generation-inference:sha-a379d55-rocm
+
+    # Required environment variables
     env:
       - HUGGING_FACE_HUB_TOKEN
       - MODEL_ID=meta-llama/Meta-Llama-3.1-70B-Instruct
       - TRUST_REMOTE_CODE=true
       - ROCM_USE_FLASH_ATTN_V2_TRITON=true
+    # Commands of the task
     commands:
       - text-generation-launcher --port 8000
+    # Service port
     port: 8000
 
     resources:
       gpu: MI300X
       disk: 150GB
 
+    # Use spot or on-demand instances
     spot_policy: auto
-
+
+    # Register the model    
     model:
       type: chat
       name: meta-llama/Meta-Llama-3.1-70B-Instruct
@@ -43,26 +49,188 @@ you can specify an AMD GPU under `resources`. Below are a few examples.
 
     </div>
 
+
+=== "vLLM"
+
+    <div editor-title="examples/deployment/vllm/amd/service.dstack.yml"> 
+
+    ```yaml
+    type: service
+    name: llama31-service-vllm-amd
+
+    # Using RunPod's ROCm Docker image
+    image: runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04
+
+    # Required environment variables
+    env:
+      - HUGGING_FACE_HUB_TOKEN
+      - MODEL_ID=meta-llama/Meta-Llama-3.1-70B-Instruct
+      - MAX_MODEL_LEN=126192
+    # Commands of the task
+    commands:
+      - export PATH=/opt/conda/envs/py_3.10/bin:$PATH
+      - wget https://github.com/ROCm/hipBLAS/archive/refs/tags/rocm-6.1.0.zip
+      - unzip rocm-6.1.0.zip
+      - cd hipBLAS-rocm-6.1.0
+      - python rmake.py
+      - cd ..
+      - git clone https://github.com/vllm-project/vllm.git
+      - cd vllm
+      - pip install triton
+      - pip uninstall torch -y
+      - pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
+      - pip install /opt/rocm/share/amd_smi
+      - pip install --upgrade numba scipy huggingface-hub[cli]
+      - pip install "numpy<2"
+      - pip install -r requirements-rocm.txt
+      - wget -N https://github.com/ROCm/vllm/raw/fa78403/rocm_patch/libamdhip64.so.6 -P /opt/rocm/lib
+      - rm -f "$(python3 -c 'import torch; print(torch.__path__[0])')"/lib/libamdhip64.so*
+      - export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
+      - wget https://dstack-binaries.s3.amazonaws.com/vllm-0.6.0%2Brocm614-cp310-cp310-linux_x86_64.whl
+      - pip install vllm-0.6.0+rocm614-cp310-cp310-linux_x86_64.whl
+      - vllm serve $MODEL_ID --max-model-len $MAX_MODEL_LEN --port 8000
+    # Service port
+    port: 8000
+
+    # Use spot or on-demand instances
+    spot_policy: auto
+
+    resources:
+      gpu: MI300X
+      disk: 200GB
+
+    # Register the model
+    model:
+      format: openai
+      type: chat
+      name: meta-llama/Meta-Llama-3.1-70B-Instruct
+    ```
+    </div>
+
+    Note, maximum size of vLLM’s `KV cache` is 126192, consequently we must set `MAX_MODEL_LEN` to 126192. Adding `/opt/conda/envs/py_3.10/bin` to PATH ensures we use the Python 3.10 environment necessary for the pre-built binaries compiled specifically for this version.
+
+    > To speed up the `vLLM-ROCm` installation, we use a pre-built binary from S3. 
+    > You can find the task to build and upload the binary in [`examples/fine-tuning/axolotl/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/deployment/vllm/amd){:target="_blank"}.
+
 !!! info "Docker image"
-    Please note that if you want to use AMD, specifying `image` is currently required. This must be an image that includes
+    If you want to use AMD, specifying `image` is currently required. This must be an image that includes
     ROCm drivers.
 
 To request multiple GPUs, specify the quantity after the GPU name, separated by a colon, e.g., `MI300X:4`.
 
-AMD accelerators can also be used with other frameworks like vLLM, Ollama, etc., and we'll be adding more examples soon.
+## Fine-tuning
+
+=== "TRL"
+
+    Below is an example of LoRA fine-tuning Llama 3.1 8B using [TRL :material-arrow-top-right-thin:{ .external }](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.html){:target="_blank"} 
+    and the [`mlabonne/guanaco-llama2-1k` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k){:target="_blank"}
+    dataset.
+
+    <div editor-title="examples/fine-tuning/trl/amd/train.dstack.yml">
+
+    ```yaml
+    type: task
+    name: trl-amd-llama31-train
+
+    # Using RunPod's ROCm Docker image
+    image: runpod/pytorch:2.1.2-py3.10-rocm6.1-ubuntu22.04
+
+    # Required environment variables
+    env:
+      - HUGGING_FACE_HUB_TOKEN
+    # Commands of the task
+    commands:
+      - export PATH=/opt/conda/envs/py_3.10/bin:$PATH
+      - git clone https://github.com/ROCm/bitsandbytes
+      - cd bitsandbytes
+      - git checkout rocm_enabled
+      - pip install -r requirements-dev.txt
+      - cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S  .
+      - make
+      - pip install .
+      - pip install trl
+      - pip install peft
+      - pip install transformers datasets huggingface-hub scipy
+      - cd ..
+      - python examples/fine-tuning/trl/amd/train.py
+
+    # Use spot or on-demand instances
+    spot_policy: auto
+
+    resources:
+      gpu: MI300X
+      disk: 150GB
+    ```
+
+    </div>
+
+=== "Axolotl"
+    Below is an example of fine-tuning Llama 3.1 8B using [Axolotl :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/axolotl/README.html){:target="_blank"} 
+    and the [tatsu-lab/alpaca :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/tatsu-lab/alpaca){:target="_blank"}
+    dataset.
+
+    <div editor-title="examples/fine-tuning/axolotl/amd/train.dstack.yml">
+
+    ```yaml
+    type: task
+    name: axolotl-amd-llama31-train
+
+    # Using RunPod's ROCm Docker image
+    image: runpod/pytorch:2.1.2-py3.10-rocm6.0.2-ubuntu22.04
+    # Required environment variables
+    env:
+      - HUGGING_FACE_HUB_TOKEN
+    # Commands of the task
+    commands:
+      - export PATH=/opt/conda/envs/py_3.10/bin:$PATH
+      - pip uninstall torch torchvision torchaudio -y
+      - python3 -m pip install --pre torch==2.3.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0/
+      - git clone https://github.com/OpenAccess-AI-Collective/axolotl
+      - cd axolotl
+      - git checkout d4f6c65
+      - pip install -e .
+      - cd ..
+      - wget https://dstack-binaries.s3.amazonaws.com/flash_attn-2.0.4-cp310-cp310-linux_x86_64.whl
+      - pip install flash_attn-2.0.4-cp310-cp310-linux_x86_64.whl
+      - wget https://dstack-binaries.s3.amazonaws.com/xformers-0.0.26-cp310-cp310-linux_x86_64.whl
+      - pip install xformers-0.0.26-cp310-cp310-linux_x86_64.whl
+      - git clone --recurse https://github.com/ROCm/bitsandbytes
+      - cd bitsandbytes
+      - git checkout rocm_enabled
+      - pip install -r requirements-dev.txt
+      - cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S  .
+      - make
+      - pip install .
+      - cd ..
+      - accelerate launch -m axolotl.cli.train axolotl/examples/llama-3/fft-8b.yaml
+
+    # Use spot or on-demand instances
+    spot_policy: auto
+
+    resources:
+      gpu: MI300X
+      disk: 150GB
+    ```
+    </div>
+    Note,To support ROCm, we need to checkout to commit `d4f6c65`. You can find the installation instruction in [rocm-blogs :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/rocm-blogs/blob/release/blogs/artificial-intelligence/axolotl/src/Dockerfile.rocm){:target="_blank"}.
+
+    > To speed up installation of `flash-attention` and `xformers `, we use pre-built binaries uploaded to S3. 
+    > You can find the tasks that build and upload the binaries
+    > in [examples/fine-tuning/axolotl/amd :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/axolotl/amd){:target="_blank"}.
 
-### Running a configuration
+## Running a configuration
 
 Once the configuration is ready, run `dstack apply -f <configuration file>`, and `dstack` will automatically provision the
 cloud resources and run the configuration.
 
-## Fleets
+<div class="termy">
 
-By default, `dstack apply` reuses `idle` instances from one of the existing [fleets](https://dstack.ai/docs/fleets).
-If no `idle` instances meet the requirements, it creates a new fleet using one of the configured backends.
+```shell
+$ HUGGING_FACE_HUB_TOKEN=...
+$ dstack apply -f examples/deployment/vllm/amd/service.dstack.yml
+```
 
-Use [fleets](https://dstack.ai/docs/fleets.md) configurations to create fleets manually. This reduces startup time for dev environments,
-tasks, and services, and is very convenient if you want to reuse fleets across runs.
+</div>
 
 ## Dev environments
 
@@ -73,9 +241,17 @@ allow you to run commands interactively.
 ## Source code
 
 The source-code of this example can be found in 
-[`examples/deployment/tgi/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/deployment/tgi/amd){:target="_blank"}.
+[`examples/deployment/tgi/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/deployment/tgi/amd){:target="_blank"},
+[`examples/deployment/vllm/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/deployment/vllm/amd){:target="_blank"},
+[`examples/fine-tuning/axolotl/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/axolotl/amd){:target="_blank"} and
+[`examples/fine-tuning/trl/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/trl/amd){:target="_blank"}
 
 ## What's next?
 
-1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), and
+1. Browse [TGI :material-arrow-top-right-thin:{ .external }](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/deploy-your-model.html#serving-using-hugging-face-tgi),
+   [vLLM :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-rocm),
+   [Axolotl :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/rocm-blogs/tree/release/blogs/artificial-intelligence/axolotl),
+   [TRL :material-arrow-top-right-thin:{ .external }](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/fine-tuning-and-inference.html) and
+   [ROCm Bitsandbytes :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/bitsandbytes)
+2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), and
    [services](https://dstack.ai/docs/services).
diff --git a/examples/accelerators/tpu/README.md b/examples/accelerators/tpu/README.md
@@ -10,7 +10,6 @@ Below are a few examples on using TPUs for deployment and fine-tuning.
 
 ## Deployment
 
-### Running as a service
 You can use any serving framework, such as vLLM, TGI. Here's an example of a [service](https://dstack.ai/docs/services) that deploys
 Llama 3.1 8B using 
 [Optimum TPU :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu){:target="_blank"}

diff --git a/examples/deployment/vllm/amd/.dstack.yml b/examples/deployment/vllm/amd/.dstack.yml
@@ -0,0 +1,15 @@
+type: dev-environment
+name: dev-vLLM-amd
+
+image: runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04
+
+env:
+  - HUGGING_FACE_HUB_TOKEN
+
+ide: vscode
+
+resources:
+  gpu: MI300X
+  disk: 150GB
+
+spot_policy: auto
diff --git a/examples/deployment/vllm/amd/build.vllm-rocm.yaml b/examples/deployment/vllm/amd/build.vllm-rocm.yaml
@@ -0,0 +1,46 @@
+type: task
+name: build-vllm-rocm
+
+image: runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04
+
+env:
+  - HUGGING_FACE_HUB_TOKEN
+  - AWS_ACCESS_KEY_ID
+  - AWS_SECRET_ACCESS_KEY
+  - AWS_REGION
+  - BUCKET_NAME
+
+command:
+  - apt-get update -y
+  - apt-get install awscli -y
+  - aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID
+  - aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY
+  - aws configure set region $AWS_REGION
+  - export PATH=/opt/conda/envs/py_3.10/bin:$PATH
+  - wget https://github.com/ROCm/hipBLAS/archive/refs/tags/rocm-6.1.0.zip
+  - unzip rocm-6.1.0.zip
+  - cd hipBLAS-rocm-6.1.0
+  - python rmake.py
+  - cd ..
+  - git clone https://github.com/vllm-project/vllm.git
+  - cd vllm
+  - pip install triton
+  - pip uninstall torch -y
+  - pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
+  - pip install /opt/rocm/share/amd_smi
+  - pip install --upgrade numba scipy huggingface-hub[cli]
+  - pip install "numpy<2"
+  - pip install -r requirements-rocm.txt
+  - wget -N https://github.com/ROCm/vllm/raw/fa78403/rocm_patch/libamdhip64.so.6 -P /opt/rocm/lib
+  - rm -f "$(python3 -c 'import torch; print(torch.__path__[0])')"/lib/libamdhip64.so*
+  - export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
+  - pip install wheel setuptools setuptools_scm ninja
+  - python setup.py bdist_wheel -d dist/
+  - cd dist
+  - aws s3 cp "$(ls -1 | head -n 1)" s3://$BUCKET_NAME/ --acl public-read
+
+spot_policy: auto
+
+resources:
+  gpu: MI300X
+  disk: 150GB
diff --git a/examples/deployment/vllm/amd/service.dstack.yml b/examples/deployment/vllm/amd/service.dstack.yml
@@ -0,0 +1,49 @@
+type: service
+name: llama31-service-vllm-amd
+
+image: runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04
+
+env:
+  - HUGGING_FACE_HUB_TOKEN
+  - MODEL_ID=meta-llama/Meta-Llama-3.1-70B-Instruct
+  - MAX_MODEL_LEN=126192
+
+commands:
+  - export PATH=/opt/conda/envs/py_3.10/bin:$PATH
+  - wget https://github.com/ROCm/hipBLAS/archive/refs/tags/rocm-6.1.0.zip
+  - unzip rocm-6.1.0.zip
+  - cd hipBLAS-rocm-6.1.0
+  - python rmake.py
+  - cd ..
+  - git clone https://github.com/vllm-project/vllm.git
+  - cd vllm
+  - pip install triton
+  - pip uninstall torch -y
+  - pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
+  - pip install /opt/rocm/share/amd_smi
+  - pip install --upgrade numba scipy huggingface-hub[cli]
+  - pip install "numpy<2"
+  - pip install -r requirements-rocm.txt
+  - wget -N https://github.com/ROCm/vllm/raw/fa78403/rocm_patch/libamdhip64.so.6 -P /opt/rocm/lib
+  - rm -f "$(python3 -c 'import torch; print(torch.__path__[0])')"/lib/libamdhip64.so*
+  - export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
+  - wget https://dstack-binaries.s3.amazonaws.com/vllm-0.6.0%2Brocm614-cp310-cp310-linux_x86_64.whl
+  - pip install vllm-0.6.0+rocm614-cp310-cp310-linux_x86_64.whl
+  - vllm serve $MODEL_ID
+      --max-model-len $MAX_MODEL_LEN
+      --port 8000
+
+# Expose the vllm server port
+port: 8000
+
+spot_policy: auto
+
+resources:
+  gpu: MI300X
+  disk: 200GB
+
+# (Optional) Enable the OpenAI-compatible endpoint
+model:
+  format: openai
+  type: chat
+  name: meta-llama/Meta-Llama-3.1-70B-Instruct