Skip to content

Commit

Permalink
Add AMD examples with vLLM, Axolotl and Trl (#1693)
Browse files Browse the repository at this point in the history
Add llama31-service-vllm-amd example

[Docs] Added vLLM with AMD example

Add AMD examples

Update AMD-Axolotl example with official example

Add build wheel tasks for AMD examples

- Minor updates

Add necessary comments in AMD Readme

Co-authored-by: Bihan  Rana <[email protected]>
  • Loading branch information
Bihan and Bihan Rana authored Sep 28, 2024
1 parent f464b7b commit cf0a139
Show file tree
Hide file tree
Showing 12 changed files with 531 additions and 24 deletions.
204 changes: 190 additions & 14 deletions examples/accelerators/amd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,34 +7,40 @@ you can specify an AMD GPU under `resources`. Below are a few examples.
## Deployment

### Running as a service
You can use any serving framework, such as TGI and vLLM. Here's an example of a [service](https://dstack.ai/docs/services) that deploys
Llama 3.1 70B in FP16 using [TGI :material-arrow-top-right-thin:{ .external }](https://huggingface.co/docs/text-generation-inference/en/installation_amd){:target="_blank"} and [vLLM :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/getting_started/amd-installation.html){:target="_blank"}.

=== "TGI"
Here's an example of a [service](https://dstack.ai/docs/services) that deploys
Llama 3.1 70B in FP16 using [TGI :material-arrow-top-right-thin:{ .external }](https://huggingface.co/docs/text-generation-inference/en/installation_amd){:target="_blank"}.

<div editor-title="examples/deployment/tgi/amd/service.dstack.yml">

```yaml
type: service
name: amd-service-tgi

# Using the official TGI's ROCm Docker image
image: ghcr.io/huggingface/text-generation-inference:sha-a379d55-rocm

# Required environment variables
env:
- HUGGING_FACE_HUB_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-70B-Instruct
- TRUST_REMOTE_CODE=true
- ROCM_USE_FLASH_ATTN_V2_TRITON=true
# Commands of the task
commands:
- text-generation-launcher --port 8000
# Service port
port: 8000

resources:
gpu: MI300X
disk: 150GB

# Use spot or on-demand instances
spot_policy: auto


# Register the model
model:
type: chat
name: meta-llama/Meta-Llama-3.1-70B-Instruct
Expand All @@ -43,26 +49,188 @@ you can specify an AMD GPU under `resources`. Below are a few examples.

</div>


=== "vLLM"

<div editor-title="examples/deployment/vllm/amd/service.dstack.yml">

```yaml
type: service
name: llama31-service-vllm-amd

# Using RunPod's ROCm Docker image
image: runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04

# Required environment variables
env:
- HUGGING_FACE_HUB_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-70B-Instruct
- MAX_MODEL_LEN=126192
# Commands of the task
commands:
- export PATH=/opt/conda/envs/py_3.10/bin:$PATH
- wget https://github.com/ROCm/hipBLAS/archive/refs/tags/rocm-6.1.0.zip
- unzip rocm-6.1.0.zip
- cd hipBLAS-rocm-6.1.0
- python rmake.py
- cd ..
- git clone https://github.com/vllm-project/vllm.git
- cd vllm
- pip install triton
- pip uninstall torch -y
- pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
- pip install /opt/rocm/share/amd_smi
- pip install --upgrade numba scipy huggingface-hub[cli]
- pip install "numpy<2"
- pip install -r requirements-rocm.txt
- wget -N https://github.com/ROCm/vllm/raw/fa78403/rocm_patch/libamdhip64.so.6 -P /opt/rocm/lib
- rm -f "$(python3 -c 'import torch; print(torch.__path__[0])')"/lib/libamdhip64.so*
- export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
- wget https://dstack-binaries.s3.amazonaws.com/vllm-0.6.0%2Brocm614-cp310-cp310-linux_x86_64.whl
- pip install vllm-0.6.0+rocm614-cp310-cp310-linux_x86_64.whl
- vllm serve $MODEL_ID --max-model-len $MAX_MODEL_LEN --port 8000
# Service port
port: 8000

# Use spot or on-demand instances
spot_policy: auto

resources:
gpu: MI300X
disk: 200GB

# Register the model
model:
format: openai
type: chat
name: meta-llama/Meta-Llama-3.1-70B-Instruct
```
</div>

Note, maximum size of vLLM’s `KV cache` is 126192, consequently we must set `MAX_MODEL_LEN` to 126192. Adding `/opt/conda/envs/py_3.10/bin` to PATH ensures we use the Python 3.10 environment necessary for the pre-built binaries compiled specifically for this version.

> To speed up the `vLLM-ROCm` installation, we use a pre-built binary from S3.
> You can find the task to build and upload the binary in [`examples/fine-tuning/axolotl/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/deployment/vllm/amd){:target="_blank"}.

!!! info "Docker image"
Please note that if you want to use AMD, specifying `image` is currently required. This must be an image that includes
If you want to use AMD, specifying `image` is currently required. This must be an image that includes
ROCm drivers.

To request multiple GPUs, specify the quantity after the GPU name, separated by a colon, e.g., `MI300X:4`.

AMD accelerators can also be used with other frameworks like vLLM, Ollama, etc., and we'll be adding more examples soon.
## Fine-tuning

=== "TRL"

Below is an example of LoRA fine-tuning Llama 3.1 8B using [TRL :material-arrow-top-right-thin:{ .external }](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.html){:target="_blank"}
and the [`mlabonne/guanaco-llama2-1k` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k){:target="_blank"}
dataset.

<div editor-title="examples/fine-tuning/trl/amd/train.dstack.yml">

```yaml
type: task
name: trl-amd-llama31-train

# Using RunPod's ROCm Docker image
image: runpod/pytorch:2.1.2-py3.10-rocm6.1-ubuntu22.04

# Required environment variables
env:
- HUGGING_FACE_HUB_TOKEN
# Commands of the task
commands:
- export PATH=/opt/conda/envs/py_3.10/bin:$PATH
- git clone https://github.com/ROCm/bitsandbytes
- cd bitsandbytes
- git checkout rocm_enabled
- pip install -r requirements-dev.txt
- cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S .
- make
- pip install .
- pip install trl
- pip install peft
- pip install transformers datasets huggingface-hub scipy
- cd ..
- python examples/fine-tuning/trl/amd/train.py

# Use spot or on-demand instances
spot_policy: auto

resources:
gpu: MI300X
disk: 150GB
```

</div>

=== "Axolotl"
Below is an example of fine-tuning Llama 3.1 8B using [Axolotl :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/axolotl/README.html){:target="_blank"}
and the [tatsu-lab/alpaca :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/tatsu-lab/alpaca){:target="_blank"}
dataset.

<div editor-title="examples/fine-tuning/axolotl/amd/train.dstack.yml">

```yaml
type: task
name: axolotl-amd-llama31-train

# Using RunPod's ROCm Docker image
image: runpod/pytorch:2.1.2-py3.10-rocm6.0.2-ubuntu22.04
# Required environment variables
env:
- HUGGING_FACE_HUB_TOKEN
# Commands of the task
commands:
- export PATH=/opt/conda/envs/py_3.10/bin:$PATH
- pip uninstall torch torchvision torchaudio -y
- python3 -m pip install --pre torch==2.3.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0/
- git clone https://github.com/OpenAccess-AI-Collective/axolotl
- cd axolotl
- git checkout d4f6c65
- pip install -e .
- cd ..
- wget https://dstack-binaries.s3.amazonaws.com/flash_attn-2.0.4-cp310-cp310-linux_x86_64.whl
- pip install flash_attn-2.0.4-cp310-cp310-linux_x86_64.whl
- wget https://dstack-binaries.s3.amazonaws.com/xformers-0.0.26-cp310-cp310-linux_x86_64.whl
- pip install xformers-0.0.26-cp310-cp310-linux_x86_64.whl
- git clone --recurse https://github.com/ROCm/bitsandbytes
- cd bitsandbytes
- git checkout rocm_enabled
- pip install -r requirements-dev.txt
- cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S .
- make
- pip install .
- cd ..
- accelerate launch -m axolotl.cli.train axolotl/examples/llama-3/fft-8b.yaml

# Use spot or on-demand instances
spot_policy: auto

resources:
gpu: MI300X
disk: 150GB
```
</div>
Note,To support ROCm, we need to checkout to commit `d4f6c65`. You can find the installation instruction in [rocm-blogs :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/rocm-blogs/blob/release/blogs/artificial-intelligence/axolotl/src/Dockerfile.rocm){:target="_blank"}.

> To speed up installation of `flash-attention` and `xformers `, we use pre-built binaries uploaded to S3.
> You can find the tasks that build and upload the binaries
> in [examples/fine-tuning/axolotl/amd :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/axolotl/amd){:target="_blank"}.

### Running a configuration
## Running a configuration

Once the configuration is ready, run `dstack apply -f <configuration file>`, and `dstack` will automatically provision the
cloud resources and run the configuration.

## Fleets
<div class="termy">

By default, `dstack apply` reuses `idle` instances from one of the existing [fleets](https://dstack.ai/docs/fleets).
If no `idle` instances meet the requirements, it creates a new fleet using one of the configured backends.
```shell
$ HUGGING_FACE_HUB_TOKEN=...
$ dstack apply -f examples/deployment/vllm/amd/service.dstack.yml
```

Use [fleets](https://dstack.ai/docs/fleets.md) configurations to create fleets manually. This reduces startup time for dev environments,
tasks, and services, and is very convenient if you want to reuse fleets across runs.
</div>

## Dev environments

Expand All @@ -73,9 +241,17 @@ allow you to run commands interactively.
## Source code

The source-code of this example can be found in
[`examples/deployment/tgi/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/deployment/tgi/amd){:target="_blank"}.
[`examples/deployment/tgi/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/deployment/tgi/amd){:target="_blank"},
[`examples/deployment/vllm/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/deployment/vllm/amd){:target="_blank"},
[`examples/fine-tuning/axolotl/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/axolotl/amd){:target="_blank"} and
[`examples/fine-tuning/trl/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/trl/amd){:target="_blank"}

## What's next?

1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), and
1. Browse [TGI :material-arrow-top-right-thin:{ .external }](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/deploy-your-model.html#serving-using-hugging-face-tgi),
[vLLM :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-rocm),
[Axolotl :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/rocm-blogs/tree/release/blogs/artificial-intelligence/axolotl),
[TRL :material-arrow-top-right-thin:{ .external }](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/fine-tuning-and-inference.html) and
[ROCm Bitsandbytes :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/bitsandbytes)
2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), and
[services](https://dstack.ai/docs/services).
1 change: 0 additions & 1 deletion examples/accelerators/tpu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ Below are a few examples on using TPUs for deployment and fine-tuning.

## Deployment

### Running as a service
You can use any serving framework, such as vLLM, TGI. Here's an example of a [service](https://dstack.ai/docs/services) that deploys
Llama 3.1 8B using
[Optimum TPU :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu){:target="_blank"}
Expand Down
15 changes: 15 additions & 0 deletions examples/deployment/vllm/amd/.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
type: dev-environment
name: dev-vLLM-amd

image: runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04

env:
- HUGGING_FACE_HUB_TOKEN

ide: vscode

resources:
gpu: MI300X
disk: 150GB

spot_policy: auto
46 changes: 46 additions & 0 deletions examples/deployment/vllm/amd/build.vllm-rocm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
type: task
name: build-vllm-rocm

image: runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04

env:
- HUGGING_FACE_HUB_TOKEN
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
- AWS_REGION
- BUCKET_NAME

command:
- apt-get update -y
- apt-get install awscli -y
- aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID
- aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY
- aws configure set region $AWS_REGION
- export PATH=/opt/conda/envs/py_3.10/bin:$PATH
- wget https://github.com/ROCm/hipBLAS/archive/refs/tags/rocm-6.1.0.zip
- unzip rocm-6.1.0.zip
- cd hipBLAS-rocm-6.1.0
- python rmake.py
- cd ..
- git clone https://github.com/vllm-project/vllm.git
- cd vllm
- pip install triton
- pip uninstall torch -y
- pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
- pip install /opt/rocm/share/amd_smi
- pip install --upgrade numba scipy huggingface-hub[cli]
- pip install "numpy<2"
- pip install -r requirements-rocm.txt
- wget -N https://github.com/ROCm/vllm/raw/fa78403/rocm_patch/libamdhip64.so.6 -P /opt/rocm/lib
- rm -f "$(python3 -c 'import torch; print(torch.__path__[0])')"/lib/libamdhip64.so*
- export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
- pip install wheel setuptools setuptools_scm ninja
- python setup.py bdist_wheel -d dist/
- cd dist
- aws s3 cp "$(ls -1 | head -n 1)" s3://$BUCKET_NAME/ --acl public-read

spot_policy: auto

resources:
gpu: MI300X
disk: 150GB
49 changes: 49 additions & 0 deletions examples/deployment/vllm/amd/service.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
type: service
name: llama31-service-vllm-amd

image: runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04

env:
- HUGGING_FACE_HUB_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-70B-Instruct
- MAX_MODEL_LEN=126192

commands:
- export PATH=/opt/conda/envs/py_3.10/bin:$PATH
- wget https://github.com/ROCm/hipBLAS/archive/refs/tags/rocm-6.1.0.zip
- unzip rocm-6.1.0.zip
- cd hipBLAS-rocm-6.1.0
- python rmake.py
- cd ..
- git clone https://github.com/vllm-project/vllm.git
- cd vllm
- pip install triton
- pip uninstall torch -y
- pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
- pip install /opt/rocm/share/amd_smi
- pip install --upgrade numba scipy huggingface-hub[cli]
- pip install "numpy<2"
- pip install -r requirements-rocm.txt
- wget -N https://github.com/ROCm/vllm/raw/fa78403/rocm_patch/libamdhip64.so.6 -P /opt/rocm/lib
- rm -f "$(python3 -c 'import torch; print(torch.__path__[0])')"/lib/libamdhip64.so*
- export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
- wget https://dstack-binaries.s3.amazonaws.com/vllm-0.6.0%2Brocm614-cp310-cp310-linux_x86_64.whl
- pip install vllm-0.6.0+rocm614-cp310-cp310-linux_x86_64.whl
- vllm serve $MODEL_ID
--max-model-len $MAX_MODEL_LEN
--port 8000

# Expose the vllm server port
port: 8000

spot_policy: auto

resources:
gpu: MI300X
disk: 200GB

# (Optional) Enable the OpenAI-compatible endpoint
model:
format: openai
type: chat
name: meta-llama/Meta-Llama-3.1-70B-Instruct
Loading

0 comments on commit cf0a139

Please sign in to comment.