-
-
Notifications
You must be signed in to change notification settings - Fork 123
9. LoRA support
Aphrodite supports loading multiple LoRA adapters, using an adaption of the S-LoRA technique. The implementation is based on Punica.
The LoRA support comes with some limitations:
- Only up to rank 64 is supported.
- Not all model classes support LoRAs.
- Only GPTQ and AWQ quantized models support it, the other quantization methods will not work.
- The adapters must be stored on disk.
You can load LoRA modules to the OpenAI API server like this:
python -m aphrodite.endpoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--enable-lora --lora-modules \
lora-1=/path/to/lora1 \
lora-2=/path/to/lora2 \
lora-3=/path/to/lora3 ...
This will launch the engine with 4 models available to be queried. The output to /v1/models
route will display 4 models: the base model and the lora adapters. Requests sent to the base model will have no LoRA applied to them, and requests passed to each adapter will apply the LoRA on top of the base model and process the request.
By default, only up to rank 32 is allowed to be loaded. You can use the --max-lora-rank
flag to set it to a higher value, e.g. 64.
Quantized models, as explained above, are supported. However, it is highly recommended you don't apply LoRAs on top of quantized models - it leads to some quality loss. The recommended method is to merge the LoRA into the base FP16 model, then quantize that. You can use this script to merge.