Skip to content

Commit

Permalink
Update main user guide in preparation for SHARK 3.1 release (#751)
Browse files Browse the repository at this point in the history
Made as many minor changes as possible as I didn't want to create new
content but rather just string together existing content. Main change
outside of adding links to the Llama documentation is to move the SDXL
quickstart into its user guide while creating a quick organizational
hierarchy for the Llama 3.1 docs. Ideally those will move into the LLama
3.1 user docs in the next release.

I've handwaved the documentation for getting Llama 3.1 70b model working
given it's an advanced topic that requires a user to be familiar with
both the hugging face cli and llama.cpp.
  • Loading branch information
amd-chrissosa authored Jan 4, 2025
1 parent d42cc29 commit 5f08cb2
Show file tree
Hide file tree
Showing 3 changed files with 66 additions and 59 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,10 +61,10 @@ optimal parameter configurations to use during model compilation.

### Models

Model name | Model recipes | Serving apps
---------- | ------------- | ------------
SDXL | [`sharktank/sharktank/models/punet/`](https://github.com/nod-ai/shark-ai/tree/main/sharktank/sharktank/models/punet) | [`shortfin/python/shortfin_apps/sd/`](https://github.com/nod-ai/shark-ai/tree/main/shortfin/python/shortfin_apps/sd)
llama | [`sharktank/sharktank/models/llama/`](https://github.com/nod-ai/shark-ai/tree/main/sharktank/sharktank/models/llama) | [`shortfin/python/shortfin_apps/llm/`](https://github.com/nod-ai/shark-ai/tree/main/shortfin/python/shortfin_apps/llm)
Model name | Model recipes | Serving apps | Guide |
---------- | ------------- | ------------ | ----- |
SDXL | [`sharktank/sharktank/models/punet/`](https://github.com/nod-ai/shark-ai/tree/main/sharktank/sharktank/models/punet) | [`shortfin/python/shortfin_apps/sd/`](https://github.com/nod-ai/shark-ai/tree/main/shortfin/python/shortfin_apps/sd) | [shortfin/python/shortfin_apps/sd/README.md](shortfin/python/shortfin_apps/sd/README.md)
llama | [`sharktank/sharktank/models/llama/`](https://github.com/nod-ai/shark-ai/tree/main/sharktank/sharktank/models/llama) | [`shortfin/python/shortfin_apps/llm/`](https://github.com/nod-ai/shark-ai/tree/main/shortfin/python/shortfin_apps/llm) | [docs/shortfin/llm/user/llama_serving.md](docs/shortfin/llm/user/llama_serving.md)

## SHARK Developers

Expand Down
67 changes: 14 additions & 53 deletions docs/user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

These instructions cover the usage of the latest stable release of SHARK. For a more bleeding edge release please install the [nightly releases](nightly_releases.md).

> [!TIP]
> Please note as we are prepping the next stable release, please use [nightly releases](nightly_releases.md) for usage.
## Prerequisites

Our current user guide requires that you have:
Expand Down Expand Up @@ -64,61 +67,19 @@ pip install shark-ai[apps]
python -m shortfin_apps.sd.server --help
```

## Quickstart
## Getting started

### Run the SDXL Server
As part of our current release we support serving [SDXL](https://stablediffusionxl.com/) and [Llama 3.1](https://ai.meta.com/blog/meta-llama-3-1/) variants as well as an initial release of `sharktank`, SHARK's model development toolkit which is leveraged in order to compile these models to be high performant.

Run the [SDXL Server](../shortfin/python/shortfin_apps/sd/README.md#Start-SDXL-Server)
### SDXL

### Run the SDXL Client
To get started with SDXL, please follow the [SDXL User Guide](../shortfin/python/shortfin_apps/sd/README.md#Start-SDXL-Server)

```
python -m shortfin_apps.sd.simple_client --interactive
```

Congratulations!!! At this point you can play around with the server and client based on your usage.

### Note: Server implementation scope

The SDXL server's implementation does not account for extremely large client batches. Normally, for heavy workloads, services would be composed under a load balancer to ensure each service is fed with requests optimally. For most cases outside of large-scale deployments, the server's internal batching/load balancing is sufficient.

### Update flags

Please see --help for both the server and client for usage instructions. Here's a quick snapshot.

#### Update server options:

| Flags | options |
|---|---|
|--host HOST |
|--port PORT | server port |
|--root-path ROOT_PATH |
|--timeout-keep-alive |
|--device | local-task,hip,amdgpu | amdgpu only supported in this release
|--target | gfx942,gfx1100 | gfx942 only supported in this release
|--device_ids |
|--tokenizers |
|--model_config |
| --workers_per_device |
| --fibers_per_device |
| --isolation | per_fiber, per_call, none |
| --show_progress |
| --trace_execution |
| --amdgpu_async_allocations |
| --splat |
| --build_preference | compile,precompiled |
| --compile_flags |
| --flagfile FLAGFILE |
| --artifacts_dir ARTIFACTS_DIR | Where to store cached artifacts from the Cloud |

#### Update client with different options:

| Flags |options|
|---|---
|--file |
|--reps |
|--save | Whether to save image generated by the server |
|--outputdir| output directory to store images generated by SDXL |
|--steps |
|--interactive |
|--port| port to interact with server |
### Llama 3.1

To get started with Llama 3.1, please follow the [Llama User Guide](shortfin/llm/user/llama_serving.md).

* Once you've set up the Llama server in the guide above, we recommend that you use [SGLang Frontend](https://sgl-project.github.io/frontend/frontend.html) by following the [Using `shortfin` with `sglang` guide](shortfin/llm/user/shortfin_with_sglang_frontend_language.md)
* If you would like to deploy LLama on a Kubernetes cluster we also provide a simple set of instructions and deployment configuration to do so [here](shortfin/llm/user/llama_serving_on_kubernetes.md).
* Finally, if you'd like to leverage the instructions above to run against a different variant of Llama 3.1, it's supported. However, you will need to generate a gguf dataset for that variant. In order to do this leverage the [HuggingFace](https://huggingface.co/)'s [`huggingface-cli`](https://huggingface.co/docs/huggingface_hub/en/guides/cli) in combination with [llama.cpp](https://github.com/ggerganov/llama.cpp)'s convert_hf_to_gguf.py. In future releases, we plan to streamline these instructions to make it easier for users to compile their own models from HuggingFace.
50 changes: 48 additions & 2 deletions shortfin/python/shortfin_apps/sd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,55 @@ python -m shortfin_apps.sd.server --device=amdgpu --device_ids=0 --build_prefere
INFO - Application startup complete.
INFO - Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```
## Run the SDXL Client
### Run the SDXL Client

- Run a CLI client in a separate shell:
```
python -m shortfin_apps.sd.simple_client --interactive
```

Congratulations!!! At this point you can play around with the server and client based on your usage.

### Note: Server implementation scope

The SDXL server's implementation does not account for extremely large client batches. Normally, for heavy workloads, services would be composed under a load balancer to ensure each service is fed with requests optimally. For most cases outside of large-scale deployments, the server's internal batching/load balancing is sufficient.

### Update flags

Please see --help for both the server and client for usage instructions. Here's a quick snapshot.

#### Update server options:

| Flags | options |
|---|---|
|--host HOST |
|--port PORT | server port |
|--root-path ROOT_PATH |
|--timeout-keep-alive |
|--device | local-task,hip,amdgpu | amdgpu only supported in this release
|--target | gfx942,gfx1100 | gfx942 only supported in this release
|--device_ids |
|--tokenizers |
|--model_config |
| --workers_per_device |
| --fibers_per_device |
| --isolation | per_fiber, per_call, none |
| --show_progress |
| --trace_execution |
| --amdgpu_async_allocations |
| --splat |
| --build_preference | compile,precompiled |
| --compile_flags |
| --flagfile FLAGFILE |
| --artifacts_dir ARTIFACTS_DIR | Where to store cached artifacts from the Cloud |

#### Update client with different options:

| Flags |options|
|---|---
|--file |
|--reps |
|--save | Whether to save image generated by the server |
|--outputdir| output directory to store images generated by SDXL |
|--steps |
|--interactive |
|--port| port to interact with server |

0 comments on commit 5f08cb2

Please sign in to comment.