From 5f08cb2354c928476ace3cfdcda3b835dd13cd61 Mon Sep 17 00:00:00 2001
From: Chris Sosa <chris.sosa@amd.com>
Date: Fri, 3 Jan 2025 18:10:29 -0800
Subject: [PATCH] Update main user guide in preparation for SHARK 3.1 release
 (#751)

Made as many minor changes as possible as I didn't want to create new
content but rather just string together existing content. Main change
outside of adding links to the Llama documentation is to move the SDXL
quickstart into its user guide while creating a quick organizational
hierarchy for the Llama 3.1 docs. Ideally those will move into the LLama
3.1 user docs in the next release.

I've handwaved the documentation for getting Llama 3.1 70b model working
given it's an advanced topic that requires a user to be familiar with
both the hugging face cli and llama.cpp.
---
 README.md                                  |  8 +--
 docs/user_guide.md                         | 67 +++++-----------------
 shortfin/python/shortfin_apps/sd/README.md | 50 +++++++++++++++-
 3 files changed, 66 insertions(+), 59 deletions(-)

diff --git a/README.md b/README.md
index ae3eac423..ccf0f72cd 100644
--- a/README.md
+++ b/README.md
@@ -61,10 +61,10 @@ optimal parameter configurations to use during model compilation.
 
 ### Models
 
-Model name | Model recipes | Serving apps
----------- | ------------- | ------------
-SDXL       | [`sharktank/sharktank/models/punet/`](https://github.com/nod-ai/shark-ai/tree/main/sharktank/sharktank/models/punet) | [`shortfin/python/shortfin_apps/sd/`](https://github.com/nod-ai/shark-ai/tree/main/shortfin/python/shortfin_apps/sd)
-llama      | [`sharktank/sharktank/models/llama/`](https://github.com/nod-ai/shark-ai/tree/main/sharktank/sharktank/models/llama) | [`shortfin/python/shortfin_apps/llm/`](https://github.com/nod-ai/shark-ai/tree/main/shortfin/python/shortfin_apps/llm)
+Model name | Model recipes | Serving apps | Guide |
+---------- | ------------- | ------------ | ----- |
+SDXL       | [`sharktank/sharktank/models/punet/`](https://github.com/nod-ai/shark-ai/tree/main/sharktank/sharktank/models/punet) | [`shortfin/python/shortfin_apps/sd/`](https://github.com/nod-ai/shark-ai/tree/main/shortfin/python/shortfin_apps/sd) | [shortfin/python/shortfin_apps/sd/README.md](shortfin/python/shortfin_apps/sd/README.md)
+llama      | [`sharktank/sharktank/models/llama/`](https://github.com/nod-ai/shark-ai/tree/main/sharktank/sharktank/models/llama) | [`shortfin/python/shortfin_apps/llm/`](https://github.com/nod-ai/shark-ai/tree/main/shortfin/python/shortfin_apps/llm) | [docs/shortfin/llm/user/llama_serving.md](docs/shortfin/llm/user/llama_serving.md)
 
 ## SHARK Developers
 
diff --git a/docs/user_guide.md b/docs/user_guide.md
index c4c3fdb58..a53e5af01 100644
--- a/docs/user_guide.md
+++ b/docs/user_guide.md
@@ -2,6 +2,9 @@
 
 These instructions cover the usage of the latest stable release of SHARK. For a more bleeding edge release please install the [nightly releases](nightly_releases.md).
 
+> [!TIP]
+> Please note as we are prepping the next stable release, please use [nightly releases](nightly_releases.md) for usage.
+
 ## Prerequisites
 
 Our current user guide requires that you have:
@@ -64,61 +67,19 @@ pip install shark-ai[apps]
 python -m shortfin_apps.sd.server --help
 ```
 
-## Quickstart
+## Getting started
 
-### Run the SDXL Server
+As part of our current release we support serving [SDXL](https://stablediffusionxl.com/) and [Llama 3.1](https://ai.meta.com/blog/meta-llama-3-1/) variants as well as an initial release of `sharktank`, SHARK's model development toolkit which is leveraged in order to compile these models to be high performant.
 
-Run the [SDXL Server](../shortfin/python/shortfin_apps/sd/README.md#Start-SDXL-Server)
+### SDXL
 
-### Run the SDXL Client
+To get started with SDXL, please follow the [SDXL User Guide](../shortfin/python/shortfin_apps/sd/README.md#Start-SDXL-Server)
 
-```
-python -m shortfin_apps.sd.simple_client --interactive
-```
 
-Congratulations!!! At this point you can play around with the server and client based on your usage.
-
-### Note: Server implementation scope
-
-The SDXL server's implementation does not account for extremely large client batches. Normally, for heavy workloads, services would be composed under a load balancer to ensure each service is fed with requests optimally. For most cases outside of large-scale deployments, the server's internal batching/load balancing is sufficient.
-
-### Update flags
-
-Please see --help for both the server and client for usage instructions. Here's a quick snapshot.
-
-#### Update server options:
-
-| Flags | options |
-|---|---|
-|--host HOST |
-|--port PORT | server port |
-|--root-path ROOT_PATH |
-|--timeout-keep-alive |
-|--device | local-task,hip,amdgpu | amdgpu only supported in this release
-|--target | gfx942,gfx1100 | gfx942 only supported in this release
-|--device_ids |
-|--tokenizers |
-|--model_config |
-| --workers_per_device |
-| --fibers_per_device |
-| --isolation |	per_fiber, per_call, none |
-| --show_progress  |
-| --trace_execution |
-| --amdgpu_async_allocations |
-| --splat   |
-| --build_preference | compile,precompiled |
-| --compile_flags |
-| --flagfile FLAGFILE |
-| --artifacts_dir ARTIFACTS_DIR | Where to store cached artifacts from the Cloud |
-
-#### Update client with different options:
-
-| Flags |options|
-|---|---
-|--file |
-|--reps |
-|--save | Whether to save image generated by the server |
-|--outputdir| output directory to store images generated by SDXL |
-|--steps |
-|--interactive |
-|--port| port to interact with server |
+### Llama 3.1
+
+To get started with Llama 3.1, please follow the [Llama User Guide](shortfin/llm/user/llama_serving.md).
+
+* Once you've set up the Llama server in the guide above, we recommend that you use [SGLang Frontend](https://sgl-project.github.io/frontend/frontend.html) by following the [Using `shortfin` with `sglang` guide](shortfin/llm/user/shortfin_with_sglang_frontend_language.md)
+* If you would like to deploy LLama on a Kubernetes cluster we also provide a simple set of instructions and deployment configuration to do so [here](shortfin/llm/user/llama_serving_on_kubernetes.md).
+* Finally, if you'd like to leverage the instructions above to run against a different variant of Llama 3.1, it's supported. However, you will need to generate a gguf dataset for that variant. In order to do this leverage the [HuggingFace](https://huggingface.co/)'s [`huggingface-cli`](https://huggingface.co/docs/huggingface_hub/en/guides/cli) in combination with [llama.cpp](https://github.com/ggerganov/llama.cpp)'s convert_hf_to_gguf.py. In future releases, we plan to streamline these instructions to make it easier for users to compile their own models from HuggingFace.
diff --git a/shortfin/python/shortfin_apps/sd/README.md b/shortfin/python/shortfin_apps/sd/README.md
index 3397be6cf..d181ed98c 100644
--- a/shortfin/python/shortfin_apps/sd/README.md
+++ b/shortfin/python/shortfin_apps/sd/README.md
@@ -22,9 +22,55 @@ python -m shortfin_apps.sd.server --device=amdgpu --device_ids=0 --build_prefere
 INFO - Application startup complete.
 INFO - Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
 ```
-## Run the SDXL Client
+### Run the SDXL Client
 
- - Run a CLI client in a separate shell:
 ```
 python -m shortfin_apps.sd.simple_client --interactive
 ```
+
+Congratulations!!! At this point you can play around with the server and client based on your usage.
+
+### Note: Server implementation scope
+
+The SDXL server's implementation does not account for extremely large client batches. Normally, for heavy workloads, services would be composed under a load balancer to ensure each service is fed with requests optimally. For most cases outside of large-scale deployments, the server's internal batching/load balancing is sufficient.
+
+### Update flags
+
+Please see --help for both the server and client for usage instructions. Here's a quick snapshot.
+
+#### Update server options:
+
+| Flags | options |
+|---|---|
+|--host HOST |
+|--port PORT | server port |
+|--root-path ROOT_PATH |
+|--timeout-keep-alive |
+|--device | local-task,hip,amdgpu | amdgpu only supported in this release
+|--target | gfx942,gfx1100 | gfx942 only supported in this release
+|--device_ids |
+|--tokenizers |
+|--model_config |
+| --workers_per_device |
+| --fibers_per_device |
+| --isolation |	per_fiber, per_call, none |
+| --show_progress  |
+| --trace_execution |
+| --amdgpu_async_allocations |
+| --splat   |
+| --build_preference | compile,precompiled |
+| --compile_flags |
+| --flagfile FLAGFILE |
+| --artifacts_dir ARTIFACTS_DIR | Where to store cached artifacts from the Cloud |
+
+#### Update client with different options:
+
+| Flags |options|
+|---|---
+|--file |
+|--reps |
+|--save | Whether to save image generated by the server |
+|--outputdir| output directory to store images generated by SDXL |
+|--steps |
+|--interactive |
+|--port| port to interact with server |