Skip to content

Commit

Permalink
SDXL deployment example on inf2 (#538)
Browse files Browse the repository at this point in the history
## Summary
- added `sdxl` deployment with tests
- updated `neuron/device.py` import to be lazy so that dynamic env vars
are used

## Related issues

<!-- For example: "Closes #1234" -->

## Checks

- [x] `make lint`: I've run `make lint` to lint the changes in this PR.
- [x] `make test`: I've made sure the tests (`make test-cpu` or `make
test`) are passing.
- Additional tests:
   - [ ] Benchmark tests (when contributing new models)
   - [ ] GPU/HW tests
  • Loading branch information
spillai authored Feb 1, 2024
1 parent 901b83d commit 3319dd2
Show file tree
Hide file tree
Showing 13 changed files with 255 additions and 18 deletions.
24 changes: 22 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
**NOS (`torch-nos`)** is a fast and flexible Pytorch inference server, specifically designed for optimizing and running inference of popular foundational AI models.
<br>

## **Why use NOS?**
## 🛠️ **Why use NOS?**

- 👩‍💻 **Easy-to-use**: Built for [PyTorch](https://pytorch.org/) and designed to optimize, serve and auto-scale Pytorch models in production without compromising on developer experience.
- 🥷 **Flexible**: Run and serve several foundational AI models ([Stable Diffusion](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), [CLIP](https://huggingface.co/openai/clip-vit-base-patch32), [Whisper](https://huggingface.co/openai/whisper-large-v2)) in a single place.
Expand All @@ -37,7 +37,27 @@
* **[Jan 2024]** ✍️ [blog] [Getting started with NOS tutorials](https://docs.nos.run/docs/blog/-getting-started-with-nos-tutorials.html) is available [here](./examples/tutorials/)!
* **[Dec 2023]** 🛝 [repo] We open-sourced the [NOS playground](https://github.com/autonomi-ai/nos-playground) to help you get started with more examples built on NOS!

## **What can NOS do?**
## 🚀 Quickstart

We highly recommend that you go to our [quickstart guide](https://docs.nos.run/docs/quickstart.html) to get started. To install the NOS client, you can run the following command:

```bash
conda create -n nos python=3.8
conda activate nos
pip install torch-nos
```

Once the client is installed, you can start the NOS server via the NOS `serve` CLI. This will automatically detect your local environment, download the docker runtime image and spin up the NOS server:

```bash
nos serve up --http
```

You are now ready to run your first inference request with NOS! You can run any of the following commands to try things out.

*Note:* For the above quickstart to work out of the box, we expect the user to have [Docker](https://docs.docker.com/get-docker/), [Nvidia Docker](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) and [Docker Compose](https://docs.docker.com/compose/install/) pre-installed on their machine. If you run into any issues, please visit our [quickstart](https://docs.nos.run/docs/quickstart.html) page or ping us on [Discord](https://discord.gg/QAGgvTuvgg).

## 👩‍💻 **What can NOS do?**

### 💬 Chat / LLM Agents (ChatGPT-as-a-Service)
---
Expand Down
16 changes: 8 additions & 8 deletions docs/concepts/runtime-environments.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@ The NOS inference server supports custom runtime environments through the use of

### ⚡️ NOS Inference Runtime

We use docker to configure different worker configurations to run workloads in different runtime environments. The configured runtime environments are specified in the [InferenceServiceRuntime](../api/server.md#inferenceserviceruntime) class, which wraps the generic [`DockerRuntime`] class. For convenience, we have pre-built some runtime environments that can be used out-of-the-box `cpu`, `gpu`, `trt-runtime` etc.
We use docker to configure different worker configurations to run workloads in different runtime environments. The configured runtime environments are specified in the [InferenceServiceRuntime](../api/server.md#inferenceserviceruntime) class, which wraps the generic [`DockerRuntime`] class. For convenience, we have pre-built some runtime environments that can be used out-of-the-box `cpu`, `gpu`, `inf2` etc.

This is the general flow of how the runtime environments are configured:
- Configure runtime environments including `cpu`, `gpu`, `trt-runtime` etc in the [`InferenceServiceRuntime`](../api/server.md#inferenceserviceruntime) `config` dictionary.
- Configure runtime environments including `cpu`, `gpu`, `inf2` etc in the [`InferenceServiceRuntime`](../api/server.md#inferenceserviceruntime) `config` dictionary.
- Start the server with the appropriate runtime environment via the `--runtime` flag.
- The ray cluster is now configured within the appropriate runtime environment and has access to the appropriate libraries and binaries.

Expand All @@ -15,12 +15,12 @@ For custom runtime support, we use [Ray](https://ray.io) to configure different

The following runtimes are supported by NOS:

| Status | Name | Pyorch | HW | Base | Description |
| - | --- | --- | --- | --- | --- |
|| [`autonomi/nos:latest-gpu`](https://hub.docker.com/r/autonomi/nos/tags) | [`2.0.1`](https://pypi.org/project/torch/2.0.1/) | CPU | `debian:buster-slim` | CPU-only runtime. |
|| [`autonomi/nos:latest-gpu`](https://hub.docker.com/r/autonomi/nos/tags) | [`2.0.1`](https://pypi.org/project/torch/2.0.1/) | NVIDIA GPU | `nvidia/cuda:11.8.0-base-ubuntu22.04` | GPU runtime. |
| **Coming Soon** | `trt` | [`2.0.1`](https://pypi.org/project/torch/2.0.1/) | NVIDIA GPU | `nvidia/cuda:11.7.0-base-ubuntu22.04` | GPU runtime with TensorRT (8.4.2.4). |
| **Coming Soon** | `inf2` | [`1.13.1`](https://pypi.org/project/torch/1.13.1/) | [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) | `debian:buster-slim` | Inf2 runtime with [torch-neuronx](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install.html). |
| Status | Name | Pyorch | HW | Base | Size | Description |
| - | --- | --- | --- | --- | --- | --- |
|| [`autonomi/nos:latest-cpu`](https://hub.docker.com/r/autonomi/nos/tags) | [`2.1.1`](https://pypi.org/project/torch/2.1.1/) | CPU | `debian:buster-slim` | 1.1 GB | CPU-only runtime. |
|| [`autonomi/nos:latest-gpu`](https://hub.docker.com/r/autonomi/nos/tags) | [`2.1.1`](https://pypi.org/project/torch/2.1.1/) | NVIDIA GPU | `nvidia/cuda:11.8.0-base-ubuntu22.04` | 3.9 GB | GPU runtime. |
| | [`autonomi/nos:latest-inf2`](https://hub.docker.com/r/autonomi/nos/tags) | [`1.13.1`](https://pypi.org/project/torch/1.13.1/) | [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) | `debian:buster-slim` | 1.7 GB | Inf2 runtime with [torch-neuronx](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install.html). |
| **Coming Soon** | `trt` | [`2.0.1`](https://pypi.org/project/torch/2.0.1/) | NVIDIA GPU | `nvidia/cuda:11.8.0-base-ubuntu22.04` | GPU runtime with TensorRT (8.4.2.4). |

### 🛠️ Adding a custom runtime

Expand Down
6 changes: 3 additions & 3 deletions docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,14 +57,14 @@ You can start the nos server programmatically via either the CLI or SDK:

=== "Via CLI"

You can start the nos server (in daemon mode) via the NOS `serve` CLI:
You can start the nos server via the NOS `serve` CLI:
```bash
nos serve up -d
nos serve up
```

Optionally, to use the REST API, you can start an HTTP gateway proxy alongside the gRPC server:
```bash
nos serve up -d --http
nos serve up --http
```

!!!note
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ setup: |
sudo apt-get install -y docker-compose-plugin
cd /app && python3 -m venv .venv && source .venv/bin/activate
pip install git+https://github.com/spillai/nos.git pytest
pip install git+https://github.com/autonomi-ai/nos.git pytest
run: |
source /app/.venv/bin/activate
Expand Down
2 changes: 1 addition & 1 deletion examples/inf2/embeddings/tests/test_embeddings_inf2.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import numpy as np


def test_embeddings():
def test_embeddings_inf2():
from models.embeddings_inf2 import EmbeddingServiceInf2

model = EmbeddingServiceInf2()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@


@pytest.mark.parametrize("model_id", ["BAAI/bge-small-en-v1.5"])
def test_embeddings_client(model_id):
def test_embeddings_inf2_client(model_id):
import numpy as np

from nos.client import Client
Expand Down
34 changes: 34 additions & 0 deletions examples/inf2/sdxl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
## Embeddings Service

Start the server via:
```bash
nos serve up -c serve.yaml --http
```

Optionally, you can provide the `inf2` runtime flag, but this is automatically inferred.

```bash
nos serve up -c serve.yaml --http --runtime inf2
```

### Run the tests

```bash
pytest -sv ./tests/test_embeddings_client.py
```

### Call the service

You can also call the service via the REST API directly:

```bash
curl \
-X POST http://<service-ip>:8000/v1/infer \
-H 'Content-Type: application/json' \
-d '{
"model_id": "BAAI/bge-small-en-v1.5",
"inputs": {
"texts": ["fox jumped over the moon"]
}
}'
```
26 changes: 26 additions & 0 deletions examples/inf2/sdxl/job-inf2-sdxl-deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Usage: sky launch -c <cluster-name> job-inf2.yaml
# image_id: ami-09c62125a680f0ead # us-east-2
# image_id: ami-0d4155c8606f16f5b # us-west-1
# image_id: ami-096319086cc3d5f23 # us-west-2

file_mounts:
/app: .

resources:
cloud: aws
region: us-west-2
instance_type: inf2.8xlarge
image_id: ami-096319086cc3d5f23 # us-west-2
disk_size: 256
ports:
- 8000

setup: |
sudo apt-get install -y docker-compose-plugin
cd /app && python3 -m venv .venv && source .venv/bin/activate
pip install git+https://github.com/autonomi-ai/nos.git pytest
run: |
source /app/.venv/bin/activate
cd /app && NOS_LOGGING_LEVEL=DEBUG nos serve up -c serve.yaml --http
113 changes: 113 additions & 0 deletions examples/inf2/sdxl/models/sdxl_inf2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
"""SDXL model accelerated with AWS Neuron (using optimum-neuron)."""
from dataclasses import dataclass, field, replace
from pathlib import Path
from typing import Any, Dict, List, Union

import torch
from PIL import Image

from nos.constants import NOS_CACHE_DIR
from nos.hub import HuggingFaceHubConfig
from nos.neuron.device import NeuronDevice


@dataclass(frozen=True)
class StableDiffusionInf2Config(HuggingFaceHubConfig):
"""SDXL model configuration for Inf2."""

batch_size: int = 1
"""Batch size for the model."""

image_height: int = 1024
"""Height of the image."""

image_width: int = 1024
"""Width of the image."""

compiler_args: Dict[str, Any] = field(
default_factory=lambda: {"auto_cast": "matmul", "auto_cast_type": "bf16"}, repr=False
)
"""Compiler arguments for the model."""

@property
def id(self) -> str:
"""Model ID."""
return f"{self.model_name}-bs-{self.batch_size}-{self.image_height}x{self.image_width}-{self.compiler_args.get('auto_cast_type', 'fp32')}"


class StableDiffusionXLInf2:
configs = {
"stabilityai/stable-diffusion-xl-base-1.0-inf2": StableDiffusionInf2Config(
model_name="stabilityai/stable-diffusion-xl-base-1.0",
),
}

def __init__(self, model_name: str = "stabilityai/stable-diffusion-xl-base-1.0-inf2"):
from nos.logging import logger

NeuronDevice.setup_environment()
try:
cfg = StableDiffusionXLInf2.configs[model_name]
except KeyError:
raise ValueError(f"Invalid model_name: {model_name}, available models: {self.configs.keys()}")
self.logger = logger
self.model = None
self.__load__(cfg)

def __load__(self, cfg: StableDiffusionInf2Config):
from optimum.neuron import NeuronStableDiffusionXLPipeline

if self.model is not None:
self.logger.debug(f"De-allocating existing model [cfg={self.cfg}, id={self.cfg.id}]")
del self.model
self.model = None
self.cfg = cfg

# Load model from cache if available, otherwise load from HF and compile
# (cache is specific to model_name, batch_size and sequence_length)
self.logger.debug(f"Loading model [cfg={self.cfg}, id={self.cfg.id}]")
cache_dir = NOS_CACHE_DIR / "neuron" / self.cfg.id
if Path(cache_dir).exists():
self.logger.debug(f"Loading model from {cache_dir}")
self.model = NeuronStableDiffusionXLPipeline.from_pretrained(str(cache_dir))
self.logger.debug(f"Loaded model from {cache_dir}")
else:
input_shapes = {
"batch_size": self.cfg.batch_size,
"height": self.cfg.image_height,
"width": self.cfg.image_width,
}
self.model = NeuronStableDiffusionXLPipeline.from_pretrained(
self.cfg.model_name, export=True, **self.cfg.compiler_args, **input_shapes
)
self.model.save_pretrained(str(cache_dir))
self.logger.debug(f"Saved model to {cache_dir}")
self.logger.debug(f"Loaded neuron model [id={self.cfg.id}]")

@torch.inference_mode()
def __call__(
self,
prompts: Union[str, List[str]],
num_images: int = 1,
num_inference_steps: int = 50,
guidance_scale: float = 7.5,
height: int = 512,
width: int = 512,
) -> List[Image.Image]:
"""Generate images from text prompt."""

if isinstance(prompts, str):
prompts = [prompts]
if isinstance(prompts, list) and len(prompts) != 1:
raise ValueError(f"Invalid number of prompts: {len(prompts)}, expected: 1")
if height != self.cfg.image_height or width != self.cfg.image_width:
cfg = replace(self.cfg, image_height=height, image_width=width)
self.logger.debug(f"Re-loading model [cfg={cfg}, id={cfg.id}, prev_id={self.cfg.id}]")
self.__load__(cfg)
assert self.model is not None
return self.model(
prompts,
num_images_per_prompt=num_images,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
).images
14 changes: 14 additions & 0 deletions examples/inf2/sdxl/serve.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
images:
custom-inf2:
base: autonomi/nos:latest-inf2
env:
NOS_LOGGING_LEVEL: DEBUG
NOS_NEURON_CORES: 2
NEURON_RT_VISIBLE_CORES: 2

models:
stabilityai/stable-diffusion-xl-base-1.0-inf2:
model_cls: StableDiffusionXLInf2
model_path: models/sdxl_inf2.py
default_method: __call__
runtime_env: custom-inf2
9 changes: 9 additions & 0 deletions examples/inf2/sdxl/tests/test_sdxl_inf2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
def test_sdxl_inf2():
from models.sdxl_inf2 import StableDiffusionXLInf2
from PIL import Image

model = StableDiffusionXLInf2()
prompts = "a photo of an astronaut riding a horse on mars"
response = model(prompts=prompts, height=1024, width=1024, num_inference_steps=50)
assert response is not None
assert isinstance(response[0], Image.Image)
21 changes: 21 additions & 0 deletions examples/inf2/sdxl/tests/test_sdxl_inf2_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import pytest


@pytest.mark.parametrize("model_id", ["stabilityai/stable-diffusion-xl-base-1.0-inf2"])
def test_sdxl_inf2_client(model_id):
from PIL import Image

from nos.client import Client

# Create a client
client = Client("[::]:50051")
assert client.WaitForServer()

# Load the embeddings model
model = client.Module(model_id)

# Run inference
prompts = "a photo of an astronaut riding a horse on mars"
response = model(prompts=prompts, height=1024, width=1024, num_inference_steps=50)
assert response is not None
assert isinstance(response[0], Image.Image)
4 changes: 2 additions & 2 deletions nos/neuron/device.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
import os
from dataclasses import dataclass

import torch_neuronx

from nos.constants import NOS_CACHE_DIR
from nos.logging import logger

Expand All @@ -21,6 +19,8 @@ def get(cls):

@staticmethod
def device_count() -> int:
import torch_neuronx

try:
return torch_neuronx.xla_impl.data_parallel.device_count()
except (RuntimeError, AssertionError):
Expand Down

0 comments on commit 3319dd2

Please sign in to comment.