- Docker engine - Installation
- Docker compose - Installation
- NVIDIA Container Toolkit - Installation
- NVIDIA GPU (The converter and triton both require GPU acess)
- OS with GPU:
- Linux/ WSL with 4090
- Linux/ WSL with 3090
- Model:
converter
with public image:ghcr.io/janhq/triton_tensorrt_llm:engine_build_89_90_5955b8afbad2ddcc3156202b16c567e94c52248f
triton
with public image:ghcr.io/janhq/triton_tensorrt_llm:engine_build_89_90_41fe3a6a9daa12c64403e084298c6169b07d489d
proxy
from openai_trtllm with public image:ghcr.io/janhq/triton_tensorrt_llm:proxy_openai_2ec5869dc61362118ebef7f097e00d2da0cc0f69
- Run command
docker compose up -d
- Test API
# Generate stream from Triton
curl --location 'http://localhost:8000/v2/models/tensorrt_llm_bls/generate_stream' \
--header 'Accept: text/event-stream' \
--header 'Content-Type: application/json' \
--data '{
"text_input": "What is machine learning?",
"parameters": {
"stream": false,
"temperature": 0,
"max_tokens": 20
}
}'
# Generate no-stream from Triton
curl --location 'http://localhost:8000/v2/models/tensorrt_llm_bls/generate' \
--header 'Content-Type: application/json' \
--data '{
"text_input": "What is machine learning?",
"parameters": {
"stream": true,
"temperature": 0,
"max_tokens": 20
}
}'
# OpenAI compatible API from the proxy
curl --location 'http://localhost:3000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"messages": [
{
"role": "user",
"content": "Who is Jensen Huang"
}
],
"stream": true,
"model": "tensorrt_llm_bls",
"max_tokens": 2048,
"temperature": 0.7
}'