Skip to content

Commit fc2b4d4

Browse files
docs(ai): add basic llm docs (#712)
Add some LLM pipeline documentation to provide the base for building on the Livepeer LLM pipeline --------- Co-authored-by: Rick Staa <[email protected]>
1 parent f2f03c4 commit fc2b4d4

File tree

5 files changed

+367
-16
lines changed

5 files changed

+367
-16
lines changed

ai/api-reference/llm.mdx

+48-16
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,11 @@
11
---
22
openapi: post /llm
33
---
4-
<Warning>
5-
We are currently deploying the Large Language Model (LLM) pipeline to our gateway infrastructure.
6-
This warning will be removed once all listed gateways have successfully transitioned to serving the LLM pipeline, ensuring a seamless and enhanced user experience.
7-
</Warning>
8-
<Note>
9-
The LLM pipeline supports streaming response by setting `stream=true` in the request. The response is then streamed with Server Sent Events (SSE)
10-
in chunks as the tokens are generated.
11-
12-
Each streaming response chunk will have the following format:
13-
14-
`data: {"chunk": "word "}`
15-
16-
The final chunk of the response will be indicated by the following format:
17-
18-
`data: {"chunk": "[DONE]", "tokens_used": 256, "done": true}`
194

20-
The Response type below is for non-streaming responses that will return all of the response in one
5+
<Note>
6+
The LLM pipeline is OpenAI API-compatible but does **not** implement all features of the OpenAI API.
217
</Note>
8+
229
<Info>
2310
The default Gateway used in this guide is the public
2411
[Livepeer.cloud](https://www.livepeer.cloud/) Gateway. It is free to use but
@@ -28,3 +15,48 @@ The Response type below is for non-streaming responses that will return all of t
2815
Gateway node or partner with one via the `ai-video` channel on
2916
[Discord](https://discord.gg/livepeer).
3017
</Info>
18+
19+
### Streaming Responses
20+
21+
<Note>
22+
Ensure your client supports SSE and processes each `data:` line as it arrives.
23+
</Note>
24+
25+
By default, the `/llm` endpoint returns a single JSON response in the OpenAI
26+
[chat/completions](https://platform.openai.com/docs/api-reference/chat/object)
27+
format, as shown in the sidebar.
28+
29+
To receive responses token-by-token, set `"stream": true` in the request body. The server will then use **Server-Sent Events (SSE)** to stream output in real time.
30+
31+
32+
Each streamed chunk will look like:
33+
34+
```json
35+
data: {
36+
"choices": [
37+
{
38+
"delta": {
39+
"content": "...token...",
40+
"role": "assistant"
41+
},
42+
"finish_reason": null
43+
}
44+
]
45+
}
46+
```
47+
48+
The final chunk will have empty content and `"finish_reason": "stop"`:
49+
50+
```json
51+
data: {
52+
"choices": [
53+
{
54+
"delta": {
55+
"content": "",
56+
"role": "assistant"
57+
},
58+
"finish_reason": "stop"
59+
}
60+
]
61+
}
62+
```

ai/pipelines/llm.mdx

+156
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
---
2+
title: LLM
3+
---
4+
5+
## Overview
6+
7+
The `llm` pipeline provides an OpenAI-compatible interface for text generation,
8+
designed to integrate seamlessly into media workflows.
9+
10+
## Models
11+
12+
The `llm` pipeline supports **any Hugging Face-compatible LLM model**. Since
13+
models evolve quickly, the set of warm (preloaded) models on Orchestrators
14+
changes regularly.
15+
16+
To see which models are currently available, check the
17+
[Network Capabilities dashboard](https://tools.livepeer.cloud/ai/network-capabilities).
18+
At the time of writing, the most commonly available model is
19+
[meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
20+
21+
<Tip>
22+
For faster responses with different
23+
[LLM](https://huggingface.co/models?pipeline_tag=text-generation)
24+
models, ask Orchestrators to load it on their GPU via the `ai-research` channel
25+
in [Discord Server](https://discord.gg/livepeer).
26+
</Tip>
27+
28+
## Basic Usage Instructions
29+
30+
<Tip>
31+
For a detailed understanding of the `llm` endpoint and to experiment with the
32+
API, see the [Livepeer AI API Reference](/ai/api-reference/llm).
33+
</Tip>
34+
35+
To generate text with the `llm` pipeline, send a `POST` request to the Gateway's
36+
`llm` API endpoint:
37+
38+
```bash
39+
curl -X POST "https://<GATEWAY_IP>/llm" \
40+
-H "Authorization: Bearer <TOKEN>" \
41+
-H "Content-Type: application/json" \
42+
-d '{
43+
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
44+
"messages": [
45+
{ "role": "user", "content": "Tell a robot story." }
46+
]
47+
}'
48+
```
49+
50+
In this command:
51+
52+
- `<GATEWAY_IP>` should be replaced with your AI Gateway's IP address.
53+
- `<TOKEN>` should be replaced with your API token if required by the AI Gateway.
54+
- `model` is the LLM model to use for generation.
55+
- `messages` is the conversation or prompt input for the model.
56+
57+
For additional optional parameters such as `temperature`, `max_tokens`, or
58+
`stream`, refer to the [Livepeer AI API Reference](/ai/api-reference/llm).
59+
60+
After execution, the Orchestrator processes the request and returns the response
61+
to the Gateway which forwards the response in response to the request.
62+
63+
Example partial non-streaming response below:
64+
```json
65+
{
66+
"id": "chatcmpl-abc123",
67+
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
68+
"choices": [
69+
{
70+
"message": {
71+
"role": "assistant",
72+
"content": "Once upon a time, in a gleaming city of circuits..."
73+
}
74+
}
75+
]
76+
}
77+
```
78+
79+
By default, responses are returned as a single JSON object. To stream output
80+
token-by-token using **Server-Sent Events (SSE)**, set `"stream": true` in the
81+
request body.
82+
83+
## Orchestrator Configuration
84+
85+
To configure your Orchestrator to serve the `llm` pipeline, refer to the
86+
[Orchestrator Configuration](/ai/orchestrators/get-started) guide.
87+
88+
### Tuning Environment Variables
89+
90+
The `llm` pipeline supports several environment variables that can be adjusted
91+
to optimize performance based on your hardware and workload. These are
92+
particularly helpful for managing memory usage and parallelism when running
93+
large models.
94+
95+
<ParamField path="USE_8BIT" type="boolean">
96+
Enables 8-bit quantization using `bitsandbytes` for lower memory usage. Set to
97+
`true` to enable. Defaults to `false`.
98+
</ParamField>
99+
<ParamField path="PIPELINE_PARALLEL_SIZE" type="integer">
100+
Number of pipeline parallel stages. Defaults to `1`.
101+
</ParamField>
102+
<ParamField path="TENSOR_PARALLEL_SIZE" type="integer">
103+
Number of tensor parallel units. Must divide evenly into the number of
104+
attention heads in the model. Defaults to `1`.
105+
</ParamField>
106+
<ParamField path="MAX_MODEL_LEN" type="integer">
107+
Maximum number of tokens per input sequence. Defaults to `8192`.
108+
</ParamField>
109+
<ParamField path="MAX_NUM_BATCHED_TOKENS" type="integer">
110+
Maximum number of tokens processed in a single batch. Should be greater than
111+
or equal to `MAX_MODEL_LEN`. Defaults to `8192`.
112+
</ParamField>
113+
<ParamField path="MAX_NUM_SEQS" type="integer">
114+
Maximum number of sequences processed per batch. Defaults to `128`.
115+
</ParamField>
116+
<ParamField path="GPU_MEMORY_UTILIZATION" type="float">
117+
Target GPU memory utilization as a float between `0` and `1`. Higher values
118+
make fuller use of GPU memory. Defaults to `0.85`.
119+
</ParamField>
120+
121+
### System Requirements
122+
123+
The following system requirements are recommended for optimal performance:
124+
125+
- [NVIDIA GPU](https://developer.nvidia.com/cuda-gpus) with **at least 16GB** of
126+
VRAM.
127+
128+
## Recommended Pipeline Pricing
129+
130+
<Note>
131+
We are planning to simplify the pricing in the future so orchestrators can set
132+
one AI price per compute unit and have the system automatically scale based on
133+
the model's compute requirements.
134+
</Note>
135+
136+
The `/llm` pipeline is currently priced based on the **maximum output tokens**
137+
specified in the request — not actual usage — due to current payment system
138+
limitations. We're actively working to support usage-based pricing to better
139+
align with industry standards.
140+
141+
The LLM pricing landscape is highly competitive and rapidly evolving.
142+
Orchestrators should set prices based on their infrastructure costs and
143+
[market positioning](https://llmpricecheck.com/). As a reference, inference on
144+
`llama-3-8b-instruct` is currently around `0.08 USD` per 1 million **output
145+
tokens**.
146+
147+
## API Reference
148+
149+
<Card
150+
title="API Reference"
151+
icon="rectangle-terminal"
152+
href="/ai/api-reference/llm"
153+
>
154+
Explore the `llm` endpoint and experiment with the API in the Livepeer AI API
155+
Reference.
156+
</Card>

ai/pipelines/overview.mdx

+4
Original file line numberDiff line numberDiff line change
@@ -98,4 +98,8 @@ pipelines:
9898
The upscale pipeline transforms low-resolution images into high-quality ones
9999
without distortion
100100
</Card>
101+
<Card title="LLM" icon="rectangle-terminal" href="/ai/pipelines/llm">
102+
The LLM pipeline provides an OpenAI-compatible interface for text
103+
generation, enabling seamless integration into media workflows.
104+
</Card>
101105
</CardGroup>

api-reference/generate/llm.mdx

+156
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
---
2+
title: LLM
3+
---
4+
5+
## Overview
6+
7+
The `llm` pipeline provides an OpenAI-compatible interface for text generation,
8+
designed to integrate seamlessly into media workflows.
9+
10+
## Models
11+
12+
The `llm` pipeline supports **any Hugging Face-compatible LLM model**. Since
13+
models evolve quickly, the set of warm (preloaded) models on Orchestrators
14+
changes regularly.
15+
16+
To see which models are currently available, check the
17+
[Network Capabilities dashboard](https://tools.livepeer.cloud/ai/network-capabilities).
18+
At the time of writing, the most commonly available model is
19+
[meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
20+
21+
<Tip>
22+
For faster responses with different
23+
[LLM](https://huggingface.co/models?pipeline_tag=text-generation) diffusion
24+
models, ask Orchestrators to load it on their GPU via the `ai-video` channel
25+
in [Discord Server](https://discord.gg/livepeer).
26+
</Tip>
27+
28+
## Basic Usage Instructions
29+
30+
<Tip>
31+
For a detailed understanding of the `llm` endpoint and to experiment with the
32+
API, see the [Livepeer AI API Reference](/ai/api-reference/llm).
33+
</Tip>
34+
35+
To generate text with the `llm` pipeline, send a `POST` request to the Gateway's
36+
`llm` API endpoint:
37+
38+
```bash
39+
curl -X POST "https://<GATEWAY_IP>/llm" \
40+
-H "Authorization: Bearer <TOKEN>" \
41+
-H "Content-Type: application/json" \
42+
-d '{
43+
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
44+
"messages": [
45+
{ "role": "user", "content": "Tell a robot story." }
46+
]
47+
}'
48+
```
49+
50+
In this command:
51+
52+
- `<GATEWAY_IP>` should be replaced with your AI Gateway's IP address.
53+
- `<TOKEN>` should be replaced with your API token.
54+
- `model` is the LLM model to use for generation.
55+
- `messages` is the conversation or prompt input for the model.
56+
57+
For additional optional parameters such as `temperature`, `max_tokens`, or
58+
`stream`, refer to the [Livepeer AI API Reference](/ai/api-reference/llm).
59+
60+
After execution, the Orchestrator processes the request and returns the response
61+
to the Gateway:
62+
63+
```json
64+
{
65+
"id": "chatcmpl-abc123",
66+
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
67+
"choices": [
68+
{
69+
"message": {
70+
"role": "assistant",
71+
"content": "Once upon a time, in a gleaming city of circuits..."
72+
}
73+
}
74+
]
75+
}
76+
```
77+
78+
By default, responses are returned as a single JSON object. To stream output
79+
token-by-token using **Server-Sent Events (SSE)**, set `"stream": true` in the
80+
request body.
81+
82+
## Orchestrator Configuration
83+
84+
To configure your Orchestrator to serve the `llm` pipeline, refer to the
85+
[Orchestrator Configuration](/ai/orchestrators/get-started) guide.
86+
87+
### Tuning Environment Variables
88+
89+
The `llm` pipeline supports several environment variables that can be adjusted
90+
to optimize performance based on your hardware and workload. These are
91+
particularly helpful for managing memory usage and parallelism when running
92+
large models.
93+
94+
<ParamField path="USE_8BIT" type="boolean">
95+
Enables 8-bit quantization using `bitsandbytes` for lower memory usage. Set to
96+
`true` to enable. Defaults to `false`.
97+
</ParamField>
98+
<ParamField path="PIPELINE_PARALLEL_SIZE" type="integer">
99+
Number of pipeline parallel stages. Should not exceed the number of model
100+
layers. Defaults to `1`.
101+
</ParamField>
102+
<ParamField path="TENSOR_PARALLEL_SIZE" type="integer">
103+
Number of tensor parallel units. Must divide evenly into the number of
104+
attention heads in the model. Defaults to `1`.
105+
</ParamField>
106+
<ParamField path="MAX_MODEL_LEN" type="integer">
107+
Maximum number of tokens per input sequence. Defaults to `8192`.
108+
</ParamField>
109+
<ParamField path="MAX_NUM_BATCHED_TOKENS" type="integer">
110+
Maximum number of tokens processed in a single batch. Should be greater than
111+
or equal to `MAX_MODEL_LEN`. Defaults to `8192`.
112+
</ParamField>
113+
<ParamField path="MAX_NUM_SEQS" type="integer">
114+
Maximum number of sequences processed per batch. Defaults to `128`.
115+
</ParamField>
116+
<ParamField path="GPU_MEMORY_UTILIZATION" type="float">
117+
Target GPU memory utilization as a float between `0` and `1`. Higher values
118+
make fuller use of GPU memory. Defaults to `0.97`.
119+
</ParamField>
120+
121+
### System Requirements
122+
123+
The following system requirements are recommended for optimal performance:
124+
125+
- [NVIDIA GPU](https://developer.nvidia.com/cuda-gpus) with **at least 16GB** of
126+
VRAM.
127+
128+
## Recommended Pipeline Pricing
129+
130+
<Note>
131+
We are planning to simplify the pricing in the future so orchestrators can set
132+
one AI price per compute unit and have the system automatically scale based on
133+
the model's compute requirements.
134+
</Note>
135+
136+
The `/llm` pipeline is currently priced based on the **maximum output tokens**
137+
specified in the request — not actual usage — due to current payment system
138+
limitations. We're actively working to support usage-based pricing to better
139+
align with industry standards.
140+
141+
The LLM pricing landscape is highly competitive and rapidly evolving.
142+
Orchestrators should set prices based on their infrastructure costs and
143+
[market positioning](https://llmpricecheck.com/). As a reference, inference on
144+
`llama-3-8b-instruct` is currently around `0.08 USD` per 1 million **output
145+
tokens**.
146+
147+
## API Reference
148+
149+
<Card
150+
title="API Reference"
151+
icon="rectangle-terminal"
152+
href="/ai/api-reference/llm"
153+
>
154+
Explore the `llm` endpoint and experiment with the API in the Livepeer AI API
155+
Reference.
156+
</Card>

0 commit comments

Comments
 (0)