SDXL deployment example on inf2 (#538)

## Summary - added `sdxl` deployment with tests - updated `neuron/device.py` import to be lazy so that dynamic env vars are used ## Related issues  ## Checks - [x] `make lint`: I've run `make lint` to lint the changes in this PR. - [x] `make test`: I've made sure the tests (`make test-cpu` or `make test`) are passing. - Additional tests: - [ ] Benchmark tests (when contributing new models) - [ ] GPU/HW tests
autonomi-ai · Feb 1, 2024 · 3319dd2 · 3319dd2
1 parent 901b83d
commit 3319dd2
Show file tree

Hide file tree

Showing 13 changed files with 255 additions and 18 deletions.
diff --git a/README.md b/README.md
@@ -16,7 +16,7 @@
 **NOS (`torch-nos`)** is a fast and flexible Pytorch inference server, specifically designed for optimizing and running inference of popular foundational AI models.
 <br>
 
-## **Why use NOS?**
+## 🛠️ **Why use NOS?**
 
 - 👩‍💻 **Easy-to-use**: Built for [PyTorch](https://pytorch.org/) and designed to optimize, serve and auto-scale Pytorch models in production without compromising on developer experience.
 - 🥷 **Flexible**: Run and serve several foundational AI models ([Stable Diffusion](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), [CLIP](https://huggingface.co/openai/clip-vit-base-patch32), [Whisper](https://huggingface.co/openai/whisper-large-v2)) in a single place.
@@ -37,7 +37,27 @@
 * **[Jan 2024]** ✍️ [blog] [Getting started with NOS tutorials](https://docs.nos.run/docs/blog/-getting-started-with-nos-tutorials.html) is available [here](./examples/tutorials/)!
 * **[Dec 2023]** 🛝 [repo] We open-sourced the [NOS playground](https://github.com/autonomi-ai/nos-playground) to help you get started with more examples built on NOS!
 
-## **What can NOS do?**
+## 🚀 Quickstart
+
+We highly recommend that you go to our [quickstart guide](https://docs.nos.run/docs/quickstart.html) to get started. To install the NOS client, you can run the following command:
+
+```bash
+conda create -n nos python=3.8
+conda activate nos
+pip install torch-nos
+```
+
+Once the client is installed, you can start the NOS server via the NOS `serve` CLI. This will automatically detect your local environment, download the docker runtime image and spin up the NOS server:
+
+```bash
+nos serve up --http
+```
+
+You are now ready to run your first inference request with NOS! You can run any of the following commands to try things out.
+
+*Note:* For the above quickstart to work out of the box, we expect the user to have [Docker](https://docs.docker.com/get-docker/), [Nvidia Docker](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) and [Docker Compose](https://docs.docker.com/compose/install/) pre-installed on their machine. If you run into any issues, please visit our [quickstart](https://docs.nos.run/docs/quickstart.html) page or ping us on [Discord](https://discord.gg/QAGgvTuvgg).
+
+## 👩‍💻 **What can NOS do?**
 
 ### 💬 Chat / LLM Agents (ChatGPT-as-a-Service)
 ---

diff --git a/docs/concepts/runtime-environments.md b/docs/concepts/runtime-environments.md
@@ -2,10 +2,10 @@ The NOS inference server supports custom runtime environments through the use of
 
 ### ⚡️ NOS Inference Runtime
 
-We use docker to configure different worker configurations to run workloads in different runtime environments. The configured runtime environments are specified in the [InferenceServiceRuntime](../api/server.md#inferenceserviceruntime) class, which wraps the generic [`DockerRuntime`] class. For convenience, we have pre-built some runtime environments that can be used out-of-the-box `cpu`, `gpu`, `trt-runtime` etc.
+We use docker to configure different worker configurations to run workloads in different runtime environments. The configured runtime environments are specified in the [InferenceServiceRuntime](../api/server.md#inferenceserviceruntime) class, which wraps the generic [`DockerRuntime`] class. For convenience, we have pre-built some runtime environments that can be used out-of-the-box `cpu`, `gpu`, `inf2` etc.
 
 This is the general flow of how the runtime environments are configured:
-- Configure runtime environments including `cpu`, `gpu`, `trt-runtime` etc in the [`InferenceServiceRuntime`](../api/server.md#inferenceserviceruntime) `config` dictionary.
+- Configure runtime environments including `cpu`, `gpu`, `inf2` etc in the [`InferenceServiceRuntime`](../api/server.md#inferenceserviceruntime) `config` dictionary.
 - Start the server with the appropriate runtime environment via the `--runtime` flag.
 - The ray cluster is now configured within the appropriate runtime environment and has access to the appropriate libraries and binaries.
 
@@ -15,12 +15,12 @@ For custom runtime support, we use [Ray](https://ray.io) to configure different
 
 The following runtimes are supported by NOS:
 
-| Status | Name | Pyorch | HW | Base | Description |
-| - | --- | --- | --- | --- | --- |
-| ✅ | [`autonomi/nos:latest-gpu`](https://hub.docker.com/r/autonomi/nos/tags)  | [`2.0.1`](https://pypi.org/project/torch/2.0.1/) | CPU | `debian:buster-slim` | CPU-only runtime. |
-| ✅ | [`autonomi/nos:latest-gpu`](https://hub.docker.com/r/autonomi/nos/tags)  | [`2.0.1`](https://pypi.org/project/torch/2.0.1/) | NVIDIA GPU | `nvidia/cuda:11.8.0-base-ubuntu22.04` | GPU runtime. |
-| **Coming Soon** | `trt` | [`2.0.1`](https://pypi.org/project/torch/2.0.1/) | NVIDIA GPU | `nvidia/cuda:11.7.0-base-ubuntu22.04` | GPU runtime with TensorRT (8.4.2.4). |
-| **Coming Soon** | `inf2` | [`1.13.1`](https://pypi.org/project/torch/1.13.1/) | [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) | `debian:buster-slim` | Inf2 runtime with [torch-neuronx](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install.html). |
+| Status | Name | Pyorch | HW | Base | Size | Description |
+| - | --- | --- | --- | --- | --- | --- |
+| ✅ | [`autonomi/nos:latest-cpu`](https://hub.docker.com/r/autonomi/nos/tags)  | [`2.1.1`](https://pypi.org/project/torch/2.1.1/) | CPU | `debian:buster-slim` | 1.1 GB | CPU-only runtime. |
+| ✅ | [`autonomi/nos:latest-gpu`](https://hub.docker.com/r/autonomi/nos/tags)  | [`2.1.1`](https://pypi.org/project/torch/2.1.1/) | NVIDIA GPU | `nvidia/cuda:11.8.0-base-ubuntu22.04` | 3.9 GB | GPU runtime. |
+| ✅ | [`autonomi/nos:latest-inf2`](https://hub.docker.com/r/autonomi/nos/tags) | [`1.13.1`](https://pypi.org/project/torch/1.13.1/) | [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) | `debian:buster-slim` | 1.7 GB | Inf2 runtime with [torch-neuronx](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install.html). |
+| **Coming Soon** | `trt` | [`2.0.1`](https://pypi.org/project/torch/2.0.1/) | NVIDIA GPU | `nvidia/cuda:11.8.0-base-ubuntu22.04` | GPU runtime with TensorRT (8.4.2.4). |
 
 ### 🛠️ Adding a custom runtime
 

diff --git a/docs/quickstart.md b/docs/quickstart.md
@@ -57,14 +57,14 @@ You can start the nos server programmatically via either the CLI or SDK:
 
 === "Via CLI"
 
-    You can start the nos server (in daemon mode) via the NOS `serve` CLI:
+    You can start the nos server via the NOS `serve` CLI:
     ```bash
-    nos serve up -d
+    nos serve up
     ```
 
     Optionally, to use the REST API, you can start an HTTP gateway proxy alongside the gRPC server:
     ```bash
-    nos serve up -d --http
+    nos serve up --http
     ```
 
     !!!note

diff --git a/examples/inf2/embeddings/job-inf2-embeddings-deployment.yaml b/examples/inf2/embeddings/job-inf2-embeddings-deployment.yaml
@@ -19,7 +19,7 @@ setup: |
   sudo apt-get install -y docker-compose-plugin
 
   cd /app && python3 -m venv .venv && source .venv/bin/activate
-  pip install git+https://github.com/spillai/nos.git pytest
+  pip install git+https://github.com/autonomi-ai/nos.git pytest
 
 run: |
   source /app/.venv/bin/activate

diff --git a/examples/inf2/embeddings/tests/test_embeddings_inf2.py b/examples/inf2/embeddings/tests/test_embeddings_inf2.py
@@ -1,7 +1,7 @@
 import numpy as np
 
 
-def test_embeddings():
+def test_embeddings_inf2():
     from models.embeddings_inf2 import EmbeddingServiceInf2
 
     model = EmbeddingServiceInf2()

diff --git a/examples/inf2/embeddings/tests/test_embeddings_inf2_client.py b/examples/inf2/embeddings/tests/test_embeddings_inf2_client.py
@@ -2,7 +2,7 @@
 
 
 @pytest.mark.parametrize("model_id", ["BAAI/bge-small-en-v1.5"])
-def test_embeddings_client(model_id):
+def test_embeddings_inf2_client(model_id):
     import numpy as np
 
     from nos.client import Client

diff --git a/examples/inf2/sdxl/README.md b/examples/inf2/sdxl/README.md
@@ -0,0 +1,34 @@
+## Embeddings Service
+
+Start the server via:
+```bash
+nos serve up -c serve.yaml --http
+```
+
+Optionally, you can provide the `inf2` runtime flag, but this is automatically inferred.
+
+```bash
+nos serve up -c serve.yaml --http --runtime inf2
+```
+
+### Run the tests
+
+```bash
+pytest -sv ./tests/test_embeddings_client.py
+```
+
+### Call the service
+
+You can also call the service via the REST API directly:
+
+```bash
+curl \
+-X POST http://<service-ip>:8000/v1/infer \
+-H 'Content-Type: application/json' \
+-d '{
+    "model_id": "BAAI/bge-small-en-v1.5",
+    "inputs": {
+        "texts": ["fox jumped over the moon"]
+    }
+}'
+```
diff --git a/examples/inf2/sdxl/job-inf2-sdxl-deployment.yaml b/examples/inf2/sdxl/job-inf2-sdxl-deployment.yaml
@@ -0,0 +1,26 @@
+# Usage: sky launch -c <cluster-name> job-inf2.yaml
+# image_id: ami-09c62125a680f0ead # us-east-2
+# image_id: ami-0d4155c8606f16f5b # us-west-1
+# image_id: ami-096319086cc3d5f23 # us-west-2
+
+file_mounts:
+  /app: .
+
+resources:
+  cloud: aws
+  region: us-west-2
+  instance_type: inf2.8xlarge
+  image_id: ami-096319086cc3d5f23 # us-west-2
+  disk_size: 256
+  ports:
+    - 8000
+
+setup: |
+  sudo apt-get install -y docker-compose-plugin
+
+  cd /app && python3 -m venv .venv && source .venv/bin/activate
+  pip install git+https://github.com/autonomi-ai/nos.git pytest
+
+run: |
+  source /app/.venv/bin/activate
+  cd /app && NOS_LOGGING_LEVEL=DEBUG nos serve up -c serve.yaml --http
diff --git a/examples/inf2/sdxl/models/sdxl_inf2.py b/examples/inf2/sdxl/models/sdxl_inf2.py
@@ -0,0 +1,113 @@
+"""SDXL model accelerated with AWS Neuron (using optimum-neuron)."""
+from dataclasses import dataclass, field, replace
+from pathlib import Path
+from typing import Any, Dict, List, Union
+
+import torch
+from PIL import Image
+
+from nos.constants import NOS_CACHE_DIR
+from nos.hub import HuggingFaceHubConfig
+from nos.neuron.device import NeuronDevice
+
+
+@dataclass(frozen=True)
+class StableDiffusionInf2Config(HuggingFaceHubConfig):
+    """SDXL model configuration for Inf2."""
+
+    batch_size: int = 1
+    """Batch size for the model."""
+
+    image_height: int = 1024
+    """Height of the image."""
+
+    image_width: int = 1024
+    """Width of the image."""
+
+    compiler_args: Dict[str, Any] = field(
+        default_factory=lambda: {"auto_cast": "matmul", "auto_cast_type": "bf16"}, repr=False
+    )
+    """Compiler arguments for the model."""
+
+    @property
+    def id(self) -> str:
+        """Model ID."""
+        return f"{self.model_name}-bs-{self.batch_size}-{self.image_height}x{self.image_width}-{self.compiler_args.get('auto_cast_type', 'fp32')}"
+
+
+class StableDiffusionXLInf2:
+    configs = {
+        "stabilityai/stable-diffusion-xl-base-1.0-inf2": StableDiffusionInf2Config(
+            model_name="stabilityai/stable-diffusion-xl-base-1.0",
+        ),
+    }
+
+    def __init__(self, model_name: str = "stabilityai/stable-diffusion-xl-base-1.0-inf2"):
+        from nos.logging import logger
+
+        NeuronDevice.setup_environment()
+        try:
+            cfg = StableDiffusionXLInf2.configs[model_name]
+        except KeyError:
+            raise ValueError(f"Invalid model_name: {model_name}, available models: {self.configs.keys()}")
+        self.logger = logger
+        self.model = None
+        self.__load__(cfg)
+
+    def __load__(self, cfg: StableDiffusionInf2Config):
+        from optimum.neuron import NeuronStableDiffusionXLPipeline
+
+        if self.model is not None:
+            self.logger.debug(f"De-allocating existing model [cfg={self.cfg}, id={self.cfg.id}]")
+            del self.model
+            self.model = None
+        self.cfg = cfg
+
+        # Load model from cache if available, otherwise load from HF and compile
+        # (cache is specific to model_name, batch_size and sequence_length)
+        self.logger.debug(f"Loading model [cfg={self.cfg}, id={self.cfg.id}]")
+        cache_dir = NOS_CACHE_DIR / "neuron" / self.cfg.id
+        if Path(cache_dir).exists():
+            self.logger.debug(f"Loading model from {cache_dir}")
+            self.model = NeuronStableDiffusionXLPipeline.from_pretrained(str(cache_dir))
+            self.logger.debug(f"Loaded model from {cache_dir}")
+        else:
+            input_shapes = {
+                "batch_size": self.cfg.batch_size,
+                "height": self.cfg.image_height,
+                "width": self.cfg.image_width,
+            }
+            self.model = NeuronStableDiffusionXLPipeline.from_pretrained(
+                self.cfg.model_name, export=True, **self.cfg.compiler_args, **input_shapes
+            )
+            self.model.save_pretrained(str(cache_dir))
+            self.logger.debug(f"Saved model to {cache_dir}")
+        self.logger.debug(f"Loaded neuron model [id={self.cfg.id}]")
+
+    @torch.inference_mode()
+    def __call__(
+        self,
+        prompts: Union[str, List[str]],
+        num_images: int = 1,
+        num_inference_steps: int = 50,
+        guidance_scale: float = 7.5,
+        height: int = 512,
+        width: int = 512,
+    ) -> List[Image.Image]:
+        """Generate images from text prompt."""
+
+        if isinstance(prompts, str):
+            prompts = [prompts]
+        if isinstance(prompts, list) and len(prompts) != 1:
+            raise ValueError(f"Invalid number of prompts: {len(prompts)}, expected: 1")
+        if height != self.cfg.image_height or width != self.cfg.image_width:
+            cfg = replace(self.cfg, image_height=height, image_width=width)
+            self.logger.debug(f"Re-loading model [cfg={cfg}, id={cfg.id}, prev_id={self.cfg.id}]")
+            self.__load__(cfg)
+            assert self.model is not None
+        return self.model(
+            prompts,
+            num_images_per_prompt=num_images,
+            num_inference_steps=num_inference_steps,
+            guidance_scale=guidance_scale,
+        ).images
diff --git a/examples/inf2/sdxl/serve.yaml b/examples/inf2/sdxl/serve.yaml
@@ -0,0 +1,14 @@
+images:
+  custom-inf2:
+    base: autonomi/nos:latest-inf2
+    env:
+      NOS_LOGGING_LEVEL: DEBUG
+      NOS_NEURON_CORES: 2
+      NEURON_RT_VISIBLE_CORES: 2
+
+models:
+  stabilityai/stable-diffusion-xl-base-1.0-inf2:
+    model_cls: StableDiffusionXLInf2
+    model_path: models/sdxl_inf2.py
+    default_method: __call__
+    runtime_env: custom-inf2
diff --git a/examples/inf2/sdxl/tests/test_sdxl_inf2.py b/examples/inf2/sdxl/tests/test_sdxl_inf2.py
@@ -0,0 +1,9 @@
+def test_sdxl_inf2():
+    from models.sdxl_inf2 import StableDiffusionXLInf2
+    from PIL import Image
+
+    model = StableDiffusionXLInf2()
+    prompts = "a photo of an astronaut riding a horse on mars"
+    response = model(prompts=prompts, height=1024, width=1024, num_inference_steps=50)
+    assert response is not None
+    assert isinstance(response[0], Image.Image)
diff --git a/examples/inf2/sdxl/tests/test_sdxl_inf2_client.py b/examples/inf2/sdxl/tests/test_sdxl_inf2_client.py
@@ -0,0 +1,21 @@
+import pytest
+
+
+@pytest.mark.parametrize("model_id", ["stabilityai/stable-diffusion-xl-base-1.0-inf2"])
+def test_sdxl_inf2_client(model_id):
+    from PIL import Image
+
+    from nos.client import Client
+
+    # Create a client
+    client = Client("[::]:50051")
+    assert client.WaitForServer()
+
+    # Load the embeddings model
+    model = client.Module(model_id)
+
+    # Run inference
+    prompts = "a photo of an astronaut riding a horse on mars"
+    response = model(prompts=prompts, height=1024, width=1024, num_inference_steps=50)
+    assert response is not None
+    assert isinstance(response[0], Image.Image)
diff --git a/nos/neuron/device.py b/nos/neuron/device.py
@@ -1,8 +1,6 @@
 import os
 from dataclasses import dataclass
 
-import torch_neuronx
-
 from nos.constants import NOS_CACHE_DIR
 from nos.logging import logger
 
@@ -21,6 +19,8 @@ def get(cls):
 
     @staticmethod
     def device_count() -> int:
+        import torch_neuronx
+
         try:
             return torch_neuronx.xla_impl.data_parallel.device_count()
         except (RuntimeError, AssertionError):