Skip to content

2. Usage

AlpinDale edited this page May 11, 2024 · 7 revisions

Usage

There are two ways to use Aphrodite Engine, via the OpenAI API server, or using it via the provided LLM class.

API Server

Aphrodite provides 2 REST API servers, OpenAI and KoboldAI. Below are examples of running the Mistral 7b on 2 GPUs:

OpenAI

aphrodite run meta-llama/Meta-Llama-3-8B -tp 2

You can query the server via curl:

curl http://localhost:2242/v1/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "meta-llama/Meta-Llama-3-8B",
  "prompt": "Every age it seems is tainted by the greed of men. Rubbish to one such as I,",
  "stream": false,
  "mirostat_mode": 2,
  "mirostat_tau": 6.5,
  "mirostat_eta": 0.2
}'

KoboldAI

Simply launch the OpenAI endpoint with --launch-kobold-api flag.

And the curl request:

curl -X 'POST' \
  'http://localhost:2242/api/v1/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "Niko the kobold stalked carefully down the alley, his small scaly figure obscured by a dusky cloak that fluttered lightly in the cold winter breeze.",
  "max_context_length": 32768,
  "max_length": 512,
  "stream": false,
  "mirostat_mode": 2,
  "mirostat_tau": 6.5,
  "mirostat_eta": 0.2
}' 

Keep in mind that -tp 2 uses the first 2 visible GPUs. Adjust that value based on the number of available GPUs.

Offline Inference

You can also use Aphrodite without setting up a REST API server, e.g. you may want to use it in your scripts.

First, import the LLM class to handle the model-related configurations, and SamplingParams for specifying sampler settings.

from aphrodite import LLM, SamplingParams

Then, define a single or a list of inputs for the model.

prompts = [
    "What is a man? A miserable little",
    "Once upon a time",
]

Specify the sampling parameters:

sampling_params = SamplingParams(temperature=1.1, min_p=0.05)

Define the model to use:

llm = LLM(model="mistralai/Mistral-7B-v0.1", tensor_parallel_size=2)
outputs = llm.generate(prompts, sampling_params

The llm.generate method will use the loaded model to process the prompts. You can then print out the responses:

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Output: {generated_text!r}")
Clone this wiki locally