OpenAI-Compatible RESTful APIs & SDK

FastChat provides OpenAI-compatible APIs for its supported models, so you can use FastChat as a local drop-in replacement for OpenAI APIs. The FastChat server is compatible with both openai-python library and cURL commands.

The following OpenAI APIs are supported:

Chat Completions. (Reference: https://platform.openai.com/docs/api-reference/chat)
Completions. (Reference: https://platform.openai.com/docs/api-reference/completions)
Embeddings. (Reference: https://platform.openai.com/docs/api-reference/embeddings)

RESTful API Server

First, launch the controller

python3 -m fastchat.serve.controller

Then, launch the model worker(s)

python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.3

Finally, launch the RESTful API server

python3 -m fastchat.serve.openai_api_server --host localhost --port 8000

Now, let us test the API server.

OpenAI Official SDK

The goal of openai_api_server.py is to implement a fully OpenAI-compatible API server, so the models can be used directly with openai-python library.

First, install openai-python:

pip install --upgrade openai

Then, interact with model vicuna:

import openai
openai.api_key = "EMPTY" # Not support yet
openai.api_base = "http://localhost:8000/v1"

model = "vicuna-7b-v1.3"
prompt = "Once upon a time"

# create a completion
completion = openai.Completion.create(model=model, prompt=prompt, max_tokens=64)
# print the completion
print(prompt + completion.choices[0].text)

# create a chat completion
completion = openai.ChatCompletion.create(
  model=model,
  messages=[{"role": "user", "content": "Hello! What is your name?"}]
)
# print the completion
print(completion.choices[0].message.content)

Streaming is also supported. See test_openai_api.py.

cURL

cURL is another good tool for observing the output of the api.

List Models:

curl http://localhost:8000/v1/models

Chat Completions:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.3",
    "messages": [{"role": "user", "content": "Hello! What is your name?"}]
  }'

Text Completions:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.3",
    "prompt": "Once upon a time",
    "max_tokens": 41,
    "temperature": 0.5
  }'

Embeddings:

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.3",
    "input": "Hello world!"
  }'

LangChain Support

This OpenAI-compatible API server supports LangChain. See LangChain Integration for details.

Adjusting Environment Variables

Timeout

By default, a timeout error will occur if a model worker does not response within 100 seconds. If your model/hardware is slower, you can change this timeout through an environment variable:

export FASTCHAT_WORKER_API_TIMEOUT=<larger timeout in seconds>

Batch size

If you meet the following OOM error while creating embeddings. You can use a smaller batch size by setting

export FASTCHAT_WORKER_API_EMBEDDING_BATCH_SIZE=1

Todos

Some features to be implemented:

Support more parameters like logprobs, logit_bias, user, presence_penalty and frequency_penalty
Model details (permissions, owner and create time)
Edits API
Authentication and API key
Rate Limitation Settings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openai_api.md

openai_api.md

OpenAI-Compatible RESTful APIs & SDK

RESTful API Server

OpenAI Official SDK

cURL

LangChain Support

Adjusting Environment Variables

Timeout

Batch size

Todos

Files

openai_api.md

Latest commit

History

openai_api.md

File metadata and controls

OpenAI-Compatible RESTful APIs & SDK

RESTful API Server

OpenAI Official SDK

cURL

LangChain Support

Adjusting Environment Variables

Timeout

Batch size

Todos