-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support chat serving for more models #44
Comments
Phi3 model is added in this PR #45 Command line to run Phi3 3.8B chat service
It uses mixed precision (F32 for rope/rmsnorm & BF16 for others) for long sequence generation (e.g., prompt over 2k tokens). Tested speed on A100: 99 tokens/s for decoding You may run Phi3 7B with different weight-path since the pipeline loads models using the corresponding config.json (I haven't tested Phi3 7B, but it should be workable in theory). |
Qwen2 model is added in this PR #46 Command line to run Qwen2 1.8B chat service
or
Tested speed on A100: ~150 tokens/s for decoding |
Mistral, Yi and StableLM are supported in #53 #57 Running cases:
|
LLaMa3/LLaMa3.1 supported in #67 Tested case:
65 tokens/s on A100 (BF16). |
We have added support for quantized models, refer to #77 |
@guoqingbao nice work with #77! |
I'm planning to parallelize the model loading process, specifically for in-situ quantization. The current strategy of loading model weights (bought from candle) layer by layer is unnecessary and inefficient. |
Open this issue for tracking the progress of models supported in candle-vllm.
The text was updated successfully, but these errors were encountered: