Support chat serving for more models #44

guoqingbao · 2024-07-02T02:38:41Z

Open this issue for tracking the progress of models supported in candle-vllm.

guoqingbao · 2024-07-02T03:02:44Z

Phi3 model is added in this PR #45

Command line to run Phi3 3.8B chat service

cargo run --release -- --port 2000 --weight-path /home/phi3-3.8b/ phi3 --repeat-last-n 64

It uses mixed precision (F32 for rope/rmsnorm & BF16 for others) for long sequence generation (e.g., prompt over 2k tokens). Tested speed on A100: 99 tokens/s for decoding

You may run Phi3 7B with different weight-path since the pipeline loads models using the corresponding config.json (I haven't tested Phi3 7B, but it should be workable in theory).

guoqingbao · 2024-07-04T10:04:44Z

Qwen2 model is added in this PR #46

Command line to run Qwen2 1.8B chat service

cargo run --release -- --port 2000 --weight-path /home/qwen2-1.8b/ qwen2 --repeat-last-n 64

or

cargo run --release -- --port 2000 --model-id Qwen/Qwen1.5-1.8B-Chat qwen2 --repeat-last-n 64

Tested speed on A100: ~150 tokens/s for decoding

guoqingbao · 2024-07-15T08:50:35Z

Mistral, Yi and StableLM are supported in #53 #57

Running cases:

cargo run --release -- --port 2000 --weight-path /home/mistral_7b/ mistral --repeat-last-n 32 --penalty 1.1 --
temperature 0.8

cargo run --release -- --port 2000 --weight-path /home/yi-6b/ yi --repeat-last-n 32

cargo run --release -- --port 2000 --weight-path /home/stablelm-zephyr-3b/ stable-lm --repeat-last-n 32

guoqingbao · 2024-07-26T03:49:27Z

LLaMa3/LLaMa3.1 supported in #67

Tested case:

cargo run --release -- --port 2000 --weight-path /home/Meta-Llama-3.1-8B-Instruct/ llama3 --repeat-last-n 64

65 tokens/s on A100 (BF16).

guoqingbao · 2024-08-13T13:32:52Z

We have added support for quantized models, refer to #77

EricLBuehler · 2024-08-14T16:52:29Z

@guoqingbao nice work with #77!

guoqingbao · 2024-08-15T07:10:18Z

@guoqingbao nice work with #77!

I'm planning to parallelize the model loading process, specifically for in-situ quantization. The current strategy of loading model weights (bought from candle) layer by layer is unnecessary and inefficient.

guoqingbao mentioned this issue Jul 2, 2024

Unified pipeline for models & support phi3 model #45

Merged

guoqingbao mentioned this issue Jul 4, 2024

Support qwen2 model, optimize phi3 model, revise model loading strategy #46

Merged

guoqingbao self-assigned this Jul 19, 2024

guoqingbao added the enhancement New feature or request label Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support chat serving for more models #44

Support chat serving for more models #44

guoqingbao commented Jul 2, 2024

guoqingbao commented Jul 2, 2024

guoqingbao commented Jul 4, 2024

guoqingbao commented Jul 15, 2024

guoqingbao commented Jul 26, 2024

guoqingbao commented Aug 13, 2024

EricLBuehler commented Aug 14, 2024

guoqingbao commented Aug 15, 2024

Support chat serving for more models #44

Support chat serving for more models #44

Comments

guoqingbao commented Jul 2, 2024

guoqingbao commented Jul 2, 2024

guoqingbao commented Jul 4, 2024

guoqingbao commented Jul 15, 2024

guoqingbao commented Jul 26, 2024

guoqingbao commented Aug 13, 2024

EricLBuehler commented Aug 14, 2024

guoqingbao commented Aug 15, 2024