9. LoRA support

Aphrodite supports loading multiple LoRA adapters, using an adaption of the S-LoRA technique. The implementation is based on Punica.

The LoRA support comes with some limitations:

Only up to rank 64 is supported.
Not all model classes support LoRAs.
Only GPTQ and AWQ quantized models support it, the other quantization methods will not work.
The adapters must be stored on disk.

You can load LoRA modules to the OpenAI API server like this:

python -m aphrodite.endpoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--enable-lora --lora-modules \
lora-1=/path/to/lora1 \
lora-2=/path/to/lora2 \
lora-3=/path/to/lora3 ...

This will launch the engine with 4 models available to be queried. The output to /v1/models route will display 4 models: the base model and the lora adapters. Requests sent to the base model will have no LoRA applied to them, and requests passed to each adapter will apply the LoRA on top of the base model and process the request.

By default, only up to rank 32 is allowed to be loaded. You can use the --max-lora-rank flag to set it to a higher value, e.g. 64.

Quantized models, as explained above, are supported. However, it is highly recommended you don't apply LoRAs on top of quantized models - it leads to some quality loss. The recommended method is to merge the LoRA into the base FP16 model, then quantize that. You can use this script to merge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

9. LoRA support

Clone this wiki locally