|
| 1 | +--- |
| 2 | +title: LLM |
| 3 | +--- |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The `llm` pipeline provides an OpenAI-compatible interface for text generation, |
| 8 | +designed to integrate seamlessly into media workflows. |
| 9 | + |
| 10 | +## Models |
| 11 | + |
| 12 | +The `llm` pipeline supports **any Hugging Face-compatible LLM model**. Since |
| 13 | +models evolve quickly, the set of warm (preloaded) models on Orchestrators |
| 14 | +changes regularly. |
| 15 | + |
| 16 | +To see which models are currently available, check the |
| 17 | +[Network Capabilities dashboard](https://tools.livepeer.cloud/ai/network-capabilities). |
| 18 | +At the time of writing, the most commonly available model is |
| 19 | +[meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). |
| 20 | + |
| 21 | +<Tip> |
| 22 | + For faster responses with different |
| 23 | + [LLM](https://huggingface.co/models?pipeline_tag=text-generation) |
| 24 | + models, ask Orchestrators to load it on their GPU via the `ai-research` channel |
| 25 | + in [Discord Server](https://discord.gg/livepeer). |
| 26 | +</Tip> |
| 27 | + |
| 28 | +## Basic Usage Instructions |
| 29 | + |
| 30 | +<Tip> |
| 31 | + For a detailed understanding of the `llm` endpoint and to experiment with the |
| 32 | + API, see the [Livepeer AI API Reference](/ai/api-reference/llm). |
| 33 | +</Tip> |
| 34 | + |
| 35 | +To generate text with the `llm` pipeline, send a `POST` request to the Gateway's |
| 36 | +`llm` API endpoint: |
| 37 | + |
| 38 | +```bash |
| 39 | +curl -X POST "https://<GATEWAY_IP>/llm" \ |
| 40 | + -H "Authorization: Bearer <TOKEN>" \ |
| 41 | + -H "Content-Type: application/json" \ |
| 42 | + -d '{ |
| 43 | + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", |
| 44 | + "messages": [ |
| 45 | + { "role": "user", "content": "Tell a robot story." } |
| 46 | + ] |
| 47 | + }' |
| 48 | +``` |
| 49 | + |
| 50 | +In this command: |
| 51 | + |
| 52 | +- `<GATEWAY_IP>` should be replaced with your AI Gateway's IP address. |
| 53 | +- `<TOKEN>` should be replaced with your API token if required by the AI Gateway. |
| 54 | +- `model` is the LLM model to use for generation. |
| 55 | +- `messages` is the conversation or prompt input for the model. |
| 56 | + |
| 57 | +For additional optional parameters such as `temperature`, `max_tokens`, or |
| 58 | +`stream`, refer to the [Livepeer AI API Reference](/ai/api-reference/llm). |
| 59 | + |
| 60 | +After execution, the Orchestrator processes the request and returns the response |
| 61 | +to the Gateway which forwards the response in response to the request. |
| 62 | + |
| 63 | +Example partial non-streaming response below: |
| 64 | +```json |
| 65 | +{ |
| 66 | + "id": "chatcmpl-abc123", |
| 67 | + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", |
| 68 | + "choices": [ |
| 69 | + { |
| 70 | + "message": { |
| 71 | + "role": "assistant", |
| 72 | + "content": "Once upon a time, in a gleaming city of circuits..." |
| 73 | + } |
| 74 | + } |
| 75 | + ] |
| 76 | +} |
| 77 | +``` |
| 78 | + |
| 79 | +By default, responses are returned as a single JSON object. To stream output |
| 80 | +token-by-token using **Server-Sent Events (SSE)**, set `"stream": true` in the |
| 81 | +request body. |
| 82 | + |
| 83 | +## Orchestrator Configuration |
| 84 | + |
| 85 | +To configure your Orchestrator to serve the `llm` pipeline, refer to the |
| 86 | +[Orchestrator Configuration](/ai/orchestrators/get-started) guide. |
| 87 | + |
| 88 | +### Tuning Environment Variables |
| 89 | + |
| 90 | +The `llm` pipeline supports several environment variables that can be adjusted |
| 91 | +to optimize performance based on your hardware and workload. These are |
| 92 | +particularly helpful for managing memory usage and parallelism when running |
| 93 | +large models. |
| 94 | + |
| 95 | +<ParamField path="USE_8BIT" type="boolean"> |
| 96 | + Enables 8-bit quantization using `bitsandbytes` for lower memory usage. Set to |
| 97 | + `true` to enable. Defaults to `false`. |
| 98 | +</ParamField> |
| 99 | +<ParamField path="PIPELINE_PARALLEL_SIZE" type="integer"> |
| 100 | + Number of pipeline parallel stages. Defaults to `1`. |
| 101 | +</ParamField> |
| 102 | +<ParamField path="TENSOR_PARALLEL_SIZE" type="integer"> |
| 103 | + Number of tensor parallel units. Must divide evenly into the number of |
| 104 | + attention heads in the model. Defaults to `1`. |
| 105 | +</ParamField> |
| 106 | +<ParamField path="MAX_MODEL_LEN" type="integer"> |
| 107 | + Maximum number of tokens per input sequence. Defaults to `8192`. |
| 108 | +</ParamField> |
| 109 | +<ParamField path="MAX_NUM_BATCHED_TOKENS" type="integer"> |
| 110 | + Maximum number of tokens processed in a single batch. Should be greater than |
| 111 | + or equal to `MAX_MODEL_LEN`. Defaults to `8192`. |
| 112 | +</ParamField> |
| 113 | +<ParamField path="MAX_NUM_SEQS" type="integer"> |
| 114 | + Maximum number of sequences processed per batch. Defaults to `128`. |
| 115 | +</ParamField> |
| 116 | +<ParamField path="GPU_MEMORY_UTILIZATION" type="float"> |
| 117 | + Target GPU memory utilization as a float between `0` and `1`. Higher values |
| 118 | + make fuller use of GPU memory. Defaults to `0.85`. |
| 119 | +</ParamField> |
| 120 | + |
| 121 | +### System Requirements |
| 122 | + |
| 123 | +The following system requirements are recommended for optimal performance: |
| 124 | + |
| 125 | +- [NVIDIA GPU](https://developer.nvidia.com/cuda-gpus) with **at least 16GB** of |
| 126 | + VRAM. |
| 127 | + |
| 128 | +## Recommended Pipeline Pricing |
| 129 | + |
| 130 | +<Note> |
| 131 | + We are planning to simplify the pricing in the future so orchestrators can set |
| 132 | + one AI price per compute unit and have the system automatically scale based on |
| 133 | + the model's compute requirements. |
| 134 | +</Note> |
| 135 | + |
| 136 | +The `/llm` pipeline is currently priced based on the **maximum output tokens** |
| 137 | +specified in the request — not actual usage — due to current payment system |
| 138 | +limitations. We're actively working to support usage-based pricing to better |
| 139 | +align with industry standards. |
| 140 | + |
| 141 | +The LLM pricing landscape is highly competitive and rapidly evolving. |
| 142 | +Orchestrators should set prices based on their infrastructure costs and |
| 143 | +[market positioning](https://llmpricecheck.com/). As a reference, inference on |
| 144 | +`llama-3-8b-instruct` is currently around `0.08 USD` per 1 million **output |
| 145 | +tokens**. |
| 146 | + |
| 147 | +## API Reference |
| 148 | + |
| 149 | +<Card |
| 150 | + title="API Reference" |
| 151 | + icon="rectangle-terminal" |
| 152 | + href="/ai/api-reference/llm" |
| 153 | +> |
| 154 | + Explore the `llm` endpoint and experiment with the API in the Livepeer AI API |
| 155 | + Reference. |
| 156 | +</Card> |
0 commit comments