Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kuntai disagg refactor #9

Merged
merged 501 commits into from
Sep 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
501 commits
Select commit Hold shift + click to select a range
f1df5db
[Misc] Update `marlin` to use vLLMParameters (#7803)
dsikka Aug 23, 2024
09c7792
Bump version to v0.5.5 (#7823)
simon-mo Aug 23, 2024
9db93de
[Core] Add multi-step support to LLMEngine (#7789)
alexm-neuralmagic Aug 23, 2024
6885fde
[Bugfix] Fix run_batch logger (#7640)
pooyadavoodi Aug 23, 2024
8da48e4
[Frontend] Publish Prometheus metrics in run_batch API (#7641)
pooyadavoodi Aug 24, 2024
d81abef
[Frontend] add json_schema support from OpenAI protocol (#7654)
rockwotj Aug 24, 2024
7d9ffa2
[misc][core] lazy import outlines (#7831)
youkaichao Aug 24, 2024
ea9fa16
[ci][test] exclude model download time in server start time (#7834)
youkaichao Aug 24, 2024
aab0fcd
[ci][test] fix RemoteOpenAIServer (#7838)
youkaichao Aug 24, 2024
80162c4
[Bugfix] Fix Phi-3v crash when input images are of certain sizes (#7840)
zifeitong Aug 25, 2024
8aaf3d5
[Model][VLM] Support multi-images inputs for Phi-3-vision models (#7…
Isotr0py Aug 25, 2024
2059b8d
[Misc] Remove snapshot_download usage in InternVL2 test (#7835)
Isotr0py Aug 25, 2024
70c094a
[misc][cuda] improve pynvml warning (#7852)
youkaichao Aug 25, 2024
1856aff
[Spec Decoding] Streamline batch expansion tensor manipulation (#7851)
njhill Aug 25, 2024
0b76999
[Bugfix]: Use float32 for base64 embedding (#7855)
HollowMan6 Aug 26, 2024
029c71d
[CI/Build] Avoid downloading all HF files in `RemoteOpenAIServer` (#7…
DarkLight1337 Aug 26, 2024
2deb029
[Performance][BlockManagerV2] Mark prefix cache block as computed aft…
comaniac Aug 26, 2024
6653040
[Misc] Update `qqq` to use vLLMParameters (#7805)
dsikka Aug 26, 2024
dd9857f
[Misc] Update `gptq_marlin_24` to use vLLMParameters (#7762)
dsikka Aug 26, 2024
05826c8
[misc] fix custom allreduce p2p cache file generation (#7853)
youkaichao Aug 26, 2024
760e9f7
[Bugfix] neuron: enable tensor parallelism (#7562)
omrishiv Aug 26, 2024
015e6cc
[Misc] Update compressed tensors lifecycle to remove `prefix` from `c…
dsikka Aug 27, 2024
2eedede
[Core] Asynchronous Output Processor (#7049)
megha95 Aug 27, 2024
39178c7
[Tests] Disable retries and use context manager for openai client (#7…
njhill Aug 27, 2024
64cc644
[core][torch.compile] discard the compile for profiling (#7796)
youkaichao Aug 27, 2024
9606c71
Revert #7509 (#7887)
comaniac Aug 27, 2024
6fc4e6e
[Model] Add Mistral Tokenization to improve robustness and chat encod…
patrickvonplaten Aug 27, 2024
9db6421
[CI/Build][VLM] Cleanup multiple images inputs model test (#7897)
Isotr0py Aug 27, 2024
076169f
[Hardware][Intel GPU] Add intel GPU pipeline parallel support. (#7810)
jikunshang Aug 27, 2024
42e932c
[CI/Build][ROCm] Enabling tensorizer tests for ROCm (#7237)
alexeykondrat Aug 27, 2024
b09c755
[Bugfix] Fix phi3v incorrect image_idx when using async engine (#7916)
Isotr0py Aug 27, 2024
ed6f002
[cuda][misc] error on empty CUDA_VISIBLE_DEVICES (#7924)
youkaichao Aug 27, 2024
fc91188
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766)
dsikka Aug 27, 2024
345be0e
[benchmark] Update TGI version (#7917)
philschmid Aug 27, 2024
5340a2d
[Model] Add multi-image input support for LLaVA-Next offline inferenc…
zifeitong Aug 27, 2024
9c71c97
[mypy] Enable mypy type checking for `vllm/core` (#7229)
jberkhahn Aug 27, 2024
fab5f53
[Core][VLM] Stack multimodal tensors to represent multiple images wit…
petersalas Aug 28, 2024
bc6e42a
[hardware][rocm] allow rocm to override default env var (#7926)
youkaichao Aug 28, 2024
c166e7e
[Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add che…
bnellnm Aug 28, 2024
51f86bf
[mypy][CI/Build] Fix mypy errors (#7929)
DarkLight1337 Aug 28, 2024
f508e03
[Core] Async_output_proc: Add virtual engine support (towards pipelin…
alexm-neuralmagic Aug 28, 2024
e358053
[Performance] Enable chunked prefill and prefix caching together (#7753)
comaniac Aug 28, 2024
f52a43a
[ci][test] fix pp test failure (#7945)
youkaichao Aug 28, 2024
98c12cf
[Doc] fix the autoAWQ example (#7937)
stas00 Aug 28, 2024
ef9baee
[Bugfix][VLM] Fix incompatibility between #7902 and #7230 (#7948)
DarkLight1337 Aug 28, 2024
b98cc28
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when availabl…
pavanimajety Aug 28, 2024
e5697d1
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize …
rasmith Aug 28, 2024
eeffde1
[TPU] Upgrade PyTorch XLA nightly (#7967)
WoosukKwon Aug 28, 2024
8c56e57
[Doc] fix 404 link (#7966)
stas00 Aug 28, 2024
fdd9daa
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#…
mzusman Aug 28, 2024
3cdfe1f
[Bugfix] Make torch registration of punica ops optional (#7970)
bnellnm Aug 28, 2024
ce6bf3a
[torch.compile] avoid Dynamo guard evaluation overhead (#7898)
youkaichao Aug 28, 2024
af59df0
Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test (#7961)
mgoin Aug 28, 2024
4289cad
[Frontend] Minor optimizations to zmq decoupled front-end (#7957)
njhill Aug 29, 2024
a7f65c2
[torch.compile] remove reset (#7975)
youkaichao Aug 29, 2024
74d5543
[VLM][Core] Fix exceptions on ragged NestedTensors (#7974)
petersalas Aug 29, 2024
ef99a78
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when …
youkaichao Aug 29, 2024
f205c09
[Bugfix] Unify rank computation across regular decoding and speculati…
jmkuebler Aug 29, 2024
3f60f22
[Core] Combine async postprocessor and multi-step (#7921)
alexm-neuralmagic Aug 29, 2024
6b34215
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFi…
pavanimajety Aug 29, 2024
c334b18
extend cuda graph size for H200 (#7894)
kushanam Aug 29, 2024
d78789a
[Bugfix] Fix incorrect vocal embedding shards for GGUF model in tenso…
Isotr0py Aug 29, 2024
86a677d
[misc] update tpu int8 to use new vLLM Parameters (#7973)
dsikka Aug 29, 2024
257afc3
[Neuron] Adding support for context-lenght, token-gen buckets. (#7885)
hbikki Aug 29, 2024
4664cea
support bitsandbytes 8-bit and FP4 quantized models (#7445)
chenqianfzh Aug 29, 2024
0c785d3
Add more percentiles and latencies (#7759)
wschin Aug 29, 2024
4abed65
[VLM] Disallow overflowing `max_model_len` for multimodal models (#7998)
DarkLight1337 Aug 30, 2024
428dd14
[Core] Logprobs support in Multi-step (#7652)
afeldman-nm Aug 30, 2024
80c7b08
[TPU] Async output processing for TPU (#8011)
WoosukKwon Aug 30, 2024
34a0e96
[Kernel] changing fused moe kernel chunk size default to 32k (#7995)
avshalomman Aug 30, 2024
dc13e99
[MODEL] add Exaone model support (#7819)
nayohan Aug 30, 2024
2148441
[TPU] Support single and multi-host TPUs on GKE (#7613)
richardsliu Aug 30, 2024
afd39a4
[Bugfix] Fix import error in Exaone model (#8034)
DarkLight1337 Aug 30, 2024
f97be32
[VLM][Model] TP support for ViTs (#7186)
ChristopherCho Aug 30, 2024
98cef6a
[Core] Increase default `max_num_batched_tokens` for multimodal model…
DarkLight1337 Aug 30, 2024
058344f
[Frontend]-config-cli-args (#7737)
KaunilD Aug 30, 2024
2684efc
[TPU][Bugfix] Fix tpu type api (#8035)
WoosukKwon Aug 30, 2024
1248e85
[Model] Adding support for MSFT Phi-3.5-MoE (#7729)
wenxcs Aug 30, 2024
622f8ab
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013)
pavanimajety Aug 31, 2024
d05f0a9
[Bugfix] Fix import error in Phi-3.5-MoE (#8052)
DarkLight1337 Aug 31, 2024
4f5d844
[Bugfix] Fix ModelScope models in v0.5.5 (#8037)
NickLucche Aug 31, 2024
8423aef
[BugFix][Core] Multistep Fix Crash on Request Cancellation (#8059)
robertgshaw2-neuralmagic Aug 31, 2024
5231f08
[Frontend][VLM] Add support for multiple multi-modal items (#8049)
ywang96 Aug 31, 2024
5b86b19
[Misc] Optional installation of audio related packages (#8063)
ywang96 Sep 1, 2024
f8d6014
[Model] Add Granite model (#7436)
shawntan Sep 2, 2024
e6a26ed
[SpecDecode][Kernel] Flashinfer Rejection Sampling (#7244)
LiuXiaoxuanPKU Sep 2, 2024
e2b2aa5
[TPU] Align worker index with node boundary (#7932)
WoosukKwon Sep 2, 2024
4ca65a9
[Core][Bugfix] Accept GGUF model without .gguf extension (#8056)
Isotr0py Sep 2, 2024
dd2a6a8
[Bugfix] Fix internlm2 tensor parallel inference (#8055)
Isotr0py Sep 2, 2024
6e36f4f
improve chunked prefill performance
noooop Sep 2, 2024
0fbc669
[Bugfix] Fix single output condition in output processor (#7881)
WoosukKwon Sep 3, 2024
ec26653
[Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backe…
Isotr0py Sep 3, 2024
bd852f2
[Performance] Enable chunked prefill and prefix caching together (#8120)
comaniac Sep 3, 2024
95a178f
[CI] Only PR reviewers/committers can trigger CI on PR (#8124)
khluu Sep 3, 2024
6d646d0
[Core] Optimize Async + Multi-step (#8050)
alexm-neuralmagic Sep 3, 2024
652c83b
[Misc] Raise a more informative exception in add/remove_logger (#7750)
Yard1 Sep 3, 2024
c02638e
[CI/Build] make pip install vllm work in macos (for import only) (#8118)
tomeras91 Sep 3, 2024
f1575dc
[ci] Fix GHA workflow (#8129)
khluu Sep 3, 2024
0af3abe
[TPU][Bugfix] Fix next_token_ids shape (#8128)
WoosukKwon Sep 3, 2024
dc0b606
[CI] Change PR remainder to avoid at-mentions (#8134)
simon-mo Sep 3, 2024
2188a60
[Misc] Update `GPTQ` to use `vLLMParameters` (#7976)
dsikka Sep 3, 2024
d4db9f5
[Benchmark] Add `--async-engine` option to benchmark_throughput.py (#…
njhill Sep 4, 2024
61f4a93
[TPU][Bugfix] Use XLA rank for persistent cache path (#8137)
WoosukKwon Sep 4, 2024
e16fa99
[Misc] Update fbgemmfp8 to use `vLLMParameters` (#7972)
dsikka Sep 4, 2024
2be8ec6
[Model] Add Ultravox support for multiple audio chunks (#7963)
petersalas Sep 4, 2024
855c262
[Frontend] Multimodal support in offline chat (#8098)
DarkLight1337 Sep 4, 2024
ccd7207
chore: Update check-wheel-size.py to read MAX_SIZE_MB from env (#8103)
haitwang-cloud Sep 4, 2024
d331156
[Bugfix] remove post_layernorm in siglip (#8106)
wnma3mz Sep 4, 2024
2ad2e56
[MISC] Consolidate FP8 kv-cache tests (#8131)
comaniac Sep 4, 2024
d1dec64
[CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369)
alexeykondrat Sep 4, 2024
561d6f8
[CI] Change test input in Gemma LoRA test (#8163)
WoosukKwon Sep 4, 2024
e02ce49
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistra…
K-Mistele Sep 4, 2024
77d9e51
[MISC] Replace input token throughput with total token throughput (#8…
comaniac Sep 4, 2024
008cf88
[Neuron] Adding support for adding/ overriding neuron configuration a…
hbikki Sep 4, 2024
32e7db2
Bump version to v0.6.0 (#8166)
simon-mo Sep 4, 2024
e01c2be
[Doc] [Misc] Create CODE_OF_CONDUCT.md (#8161)
mmcelaney Sep 4, 2024
1afc931
[bugfix] >1.43 constraint for openai (#8169)
SolitaryThinker Sep 5, 2024
4624d98
[Misc] Clean up RoPE forward_native (#8076)
WoosukKwon Sep 5, 2024
ba262c4
[ci] Mark LoRA test as soft-fail (#8160)
khluu Sep 5, 2024
e39ebf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8…
elfiegg Sep 5, 2024
288a938
[Doc] Indicate more information about supported modalities (#8181)
DarkLight1337 Sep 5, 2024
8685ba1
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parall…
Manikandan-Thangaraj-ZS0321 Sep 5, 2024
9da25a8
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029)
alex-jw-brooks Sep 5, 2024
2ee4528
Move verify_marlin_supported to GPTQMarlinLinearMethod (#8165)
mgoin Sep 5, 2024
2febcf2
[Documentation][Spec Decode] Add documentation about lossless guarant…
sroy745 Sep 5, 2024
db3bf7c
[Core] Support load and unload LoRA in api server (#6566)
Jeffwan Sep 6, 2024
baa5467
[BugFix] Fix Granite model configuration (#8216)
njhill Sep 6, 2024
e5cab71
[Frontend] Add --logprobs argument to `benchmark_serving.py` (#8191)
afeldman-nm Sep 6, 2024
de80783
[Misc] Use ray[adag] dependency instead of cuda (#7938)
ruisearch42 Sep 6, 2024
1447c97
[CI/Build] Increasing timeout for multiproc worker tests (#8203)
alexeykondrat Sep 6, 2024
9db52ea
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize…
rasmith Sep 6, 2024
23f3222
[Misc] Remove `SqueezeLLM` (#8220)
dsikka Sep 6, 2024
29f49cd
[Model] Allow loading from original Mistral format (#8168)
patrickvonplaten Sep 6, 2024
12dd715
[misc] [doc] [frontend] LLM torch profiler support (#7943)
SolitaryThinker Sep 7, 2024
41e95c5
[Bugfix] Fix Hermes tool call chat template bug (#8256)
K-Mistele Sep 7, 2024
2f707fc
[Model] Multi-input support for LLaVA (#8238)
DarkLight1337 Sep 7, 2024
795b662
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_ser…
wschin Sep 7, 2024
ce2702a
[tpu][misc] fix typo (#8260)
youkaichao Sep 7, 2024
9f68e00
[Bugfix] Fix broken OpenAI tensorizer test (#8258)
DarkLight1337 Sep 7, 2024
e807125
[Model][VLM] Support multi-images inputs for InternVL2 models (#8201)
Isotr0py Sep 7, 2024
36bf815
[Model][VLM] Decouple weight loading logic for `Paligemma` (#8269)
Isotr0py Sep 7, 2024
b962ee1
ppc64le: Dockerfile fixed, and a script for buildkite (#8026)
sumitd2 Sep 7, 2024
cfe712b
[CI/Build] Use python 3.12 in cuda image (#8133)
joerunde Sep 7, 2024
4ef41b8
[Bugfix] Fix async postprocessor in case of preemption (#8267)
alexm-neuralmagic Sep 8, 2024
08287ef
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format…
K-Mistele Sep 9, 2024
58fcc85
[Frontend] Add progress reporting to run_batch.py (#8060)
alugowski Sep 9, 2024
f9b4a2d
[Bugfix] Correct adapter usage for cohere and jamba (#8292)
vladislavkruglikov Sep 9, 2024
c7cb5c3
[Misc] GPTQ Activation Ordering (#8135)
kylesayrs Sep 9, 2024
6cd5e5b
[Misc] Fused MoE Marlin support for GPTQ (#8217)
dsikka Sep 10, 2024
a1d8742
Add NVIDIA Meetup slides, announce AMD meetup, and add contact info (…
simon-mo Sep 10, 2024
da1a844
[Bugfix] Fix missing `post_layernorm` in CLIP (#8155)
DarkLight1337 Sep 10, 2024
6234385
[CI/Build] enable ccache/scccache for HIP builds (#8327)
dtrifiro Sep 10, 2024
8c054b7
[Frontend] Clean up type annotations for mistral tokenizer (#8314)
DarkLight1337 Sep 10, 2024
f421f3c
[CI/Build] Enabling kernels tests for AMD, ignoring some of then that…
alexeykondrat Sep 10, 2024
02751a7
Fix ppc64le buildkite job (#8309)
sumitd2 Sep 10, 2024
5faedf1
[Spec Decode] Move ops.advance_step to flash attn advance_step (#8224)
kevin314 Sep 10, 2024
04e7c4e
[Misc] remove peft as dependency for prompt models (#8162)
prashantgupta24 Sep 10, 2024
b1f3e18
[MISC] Keep chunked prefill enabled by default with long context when…
comaniac Sep 10, 2024
22f3a4b
[Bugfix] lookahead block table with cuda graph max capture (#8340)
alexm-neuralmagic Sep 10, 2024
1d5e397
[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers (#8172)
SolitaryThinker Sep 10, 2024
94144e7
[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag (#8043)
tlrmchlsmth Sep 10, 2024
e497b8a
[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models (#8329)
jeejeelee Sep 11, 2024
1230263
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parall…
Isotr0py Sep 11, 2024
efcf946
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. (…
pavanimajety Sep 11, 2024
6a512a0
[model] Support for Llava-Next-Video model (#7559)
TKONIY Sep 11, 2024
cea95df
[Frontend] Create ErrorResponse instead of raising exceptions in run_…
pooyadavoodi Sep 11, 2024
3b7fea7
[Model][VLM] Add Qwen2-VL model support (#7905)
fyabc Sep 11, 2024
0b952af
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257)
bigPYJ1151 Sep 11, 2024
aea02f3
[CI/Build] Excluding test_moe.py from AMD Kernels tests for investiga…
alexeykondrat Sep 11, 2024
7015417
[Bugfix] Add missing attributes in mistral tokenizer (#8364)
DarkLight1337 Sep 11, 2024
73202db
[Kernel][Misc] register ops to prevent graph breaks (#6917)
bnellnm Sep 11, 2024
8baa454
[Misc] Move device options to a single place (#8322)
akx Sep 11, 2024
775f00f
[Speculative Decoding] Test refactor (#8317)
LiuXiaoxuanPKU Sep 11, 2024
d394787
Pixtral (#8377)
patrickvonplaten Sep 11, 2024
3fd2b0d
Bump version to v0.6.1 (#8379)
simon-mo Sep 11, 2024
a65cb16
[MISC] Dump model runner inputs when crashing (#8305)
comaniac Sep 12, 2024
f842a7a
[misc] remove engine_use_ray (#8126)
youkaichao Sep 12, 2024
b71c956
[TPU] Use Ray for default distributed backend (#8389)
WoosukKwon Sep 12, 2024
b6c75e1
Fix the AMD weight loading tests (#8390)
mgoin Sep 12, 2024
5a60699
[Bugfix]: Fix the logic for deciding if tool parsing is used (#8366)
tomeras91 Sep 12, 2024
1bf2dd9
[Gemma2] add bitsandbytes support for Gemma2 (#8338)
blueyo0 Sep 12, 2024
295c473
[Misc] Raise error when using encoder/decoder model with cpu backend …
kevin314 Sep 12, 2024
42ffba1
[Misc] Use RoPE cache for MRoPE (#8396)
WoosukKwon Sep 12, 2024
7de49aa
[torch.compile] hide slicing under custom op for inductor (#8384)
youkaichao Sep 12, 2024
520ca38
[Hotfix][VLM] Fixing max position embeddings for Pixtral (#8399)
ywang96 Sep 12, 2024
e56bf27
[Bugfix] Fix InternVL2 inference with various num_patches (#8375)
Isotr0py Sep 12, 2024
c6202da
[Model] Support multiple images for qwen-vl (#8247)
alex-jw-brooks Sep 12, 2024
8a23e93
[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instanc…
lnykww Sep 12, 2024
1f0c75a
[BugFix] Fix Duplicate Assignment in Hermes2ProToolParser (#8423)
vegaluisjose Sep 12, 2024
f2e263b
[Bugfix] Offline mode fix (#8376)
joerunde Sep 12, 2024
a6c0f36
[multi-step] add flashinfer backend (#7928)
SolitaryThinker Sep 12, 2024
551ce01
[Core] Add engine option to return only deltas or final output (#7381)
njhill Sep 12, 2024
0198772
[Bugfix] multi-step + flashinfer: ensure cuda graph compatible (#8427)
alexm-neuralmagic Sep 12, 2024
c163694
[Hotfix][Core][VLM] Disable chunked prefill by default and prefix cac…
ywang96 Sep 12, 2024
b61bd98
[CI/Build] Disable multi-node test for InternVL2 (#8428)
ywang96 Sep 12, 2024
d31174a
[Hotfix][Pixtral] Fix multiple images bugs (#8415)
patrickvonplaten Sep 12, 2024
a480939
[Bugfix] Fix weight loading issue by rename variable. (#8293)
wenxcs Sep 12, 2024
360ddbd
[Misc] Update Pixtral example (#8431)
ywang96 Sep 13, 2024
8f44a92
[BugFix] fix group_topk (#8430)
dsikka Sep 13, 2024
5ec9c0f
[Core] Factor out input preprocessing to a separate class (#7329)
DarkLight1337 Sep 13, 2024
40c3965
[Bugfix] Mapping physical device indices for e2e test utils (#8290)
ShangmingCai Sep 13, 2024
3f79bc3
[Bugfix] Bump fastapi and pydantic version (#8435)
DarkLight1337 Sep 13, 2024
8427550
[CI/Build] Update pixtral tests to use JSON (#8436)
DarkLight1337 Sep 13, 2024
6821020
[Bugfix] Fix async log stats (#8417)
alexm-neuralmagic Sep 13, 2024
ba77527
[bugfix] torch profiler bug for single gpu with GPUExecutor (#8354)
SolitaryThinker Sep 13, 2024
acda0b3
bump version to v0.6.1.post1 (#8440)
simon-mo Sep 13, 2024
9b4a3b2
[CI/Build] Enable InternVL2 PP test only on single node (#8437)
Isotr0py Sep 13, 2024
cab69a1
[doc] recommend pip instead of conda (#8446)
youkaichao Sep 13, 2024
06311e2
[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 (#8442)
jeejeelee Sep 13, 2024
a246912
[misc][ci] fix quant test (#8449)
youkaichao Sep 13, 2024
ecd7a1d
[Installation] Gate FastAPI version for Python 3.8 (#8456)
DarkLight1337 Sep 13, 2024
0a4806f
[plugin][torch.compile] allow to add custom compile backend (#8445)
youkaichao Sep 13, 2024
a84e598
[CI/Build] Reorganize models tests (#7820)
DarkLight1337 Sep 13, 2024
f57092c
[Doc] Add oneDNN installation to CPU backend documentation (#8467)
Isotr0py Sep 13, 2024
18e9e1f
[HotFix] Fix final output truncation with stop string + streaming (#8…
njhill Sep 13, 2024
9ba0817
bump version to v0.6.1.post2 (#8473)
simon-mo Sep 13, 2024
8517252
[Hardware][intel GPU] bump up ipex version to 2.3 (#8365)
jikunshang Sep 13, 2024
1ef0d2e
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310)
charlifu Sep 14, 2024
8a0cf1d
[Model] support minicpm3 (#8297)
SUDA-HLT-ywfang Sep 14, 2024
a36e070
[torch.compile] fix functionalization (#8480)
youkaichao Sep 14, 2024
47790f3
[torch.compile] add a flag to disable custom op (#8488)
youkaichao Sep 14, 2024
50e9ec4
[TPU] Implement multi-step scheduling (#8489)
WoosukKwon Sep 14, 2024
3724d5f
[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by upda…
chrisociepa Sep 15, 2024
1f47731
Merge pull request #7 from KuntaiDu/jiayi-dev-v2
KuntaiDu Sep 15, 2024
0dd3571
Merge pull request #8 from KuntaiDu/jiayi-dev-v2
KuntaiDu Sep 15, 2024
9c98d5f
resolve merge conflict
ApostaC Sep 15, 2024
515c47b
remove group coordinator import
ApostaC Sep 15, 2024
f166cf8
remove syntax bug
ApostaC Sep 15, 2024
f320518
update round robin proxy. Prior bash-based impl is buggy
ApostaC Sep 15, 2024
5b4a3e3
update docs for disagg overhead benchmark
ApostaC Sep 15, 2024
01b2fd3
use new round robin proxy in performance benchmark
ApostaC Sep 15, 2024
54bd11f
update
ApostaC Sep 15, 2024
b19f346
update benchmarking script
ApostaC Sep 15, 2024
cb7ff06
revert changes in model_runner.py --- no change needed for disagg pre…
ApostaC Sep 15, 2024
dd8c86d
no I was wrong
ApostaC Sep 15, 2024
4e8043c
update benchmark
ApostaC Sep 15, 2024
b51f891
remove sonnet 4x --- it can be automatically generated via benchmarki…
ApostaC Sep 15, 2024
168452f
revert change in flash attn and flash infer to clean up the diff
ApostaC Sep 15, 2024
784d905
update the example
ApostaC Sep 15, 2024
17d2505
make format checker happy
ApostaC Sep 15, 2024
36a382c
resolve circular import
ApostaC Sep 15, 2024
a0867dd
fix redundant import
ApostaC Sep 15, 2024
7f90903
rename to a shorter name
ApostaC Sep 15, 2024
5ca22fb
remove unnecessary file
ApostaC Sep 16, 2024
073642b
update kv transfer test
ApostaC Sep 16, 2024
70d6571
update tests
ApostaC Sep 16, 2024
4d6b00a
make fmt checker happy
ApostaC Sep 16, 2024
7c13e03
constraint the model length
ApostaC Sep 16, 2024
cf5b84c
adjust path
ApostaC Sep 16, 2024
eb751d6
add disagg prefill test to test pipeline
ApostaC Sep 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
35 changes: 21 additions & 14 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -1,36 +1,43 @@
import os
import sys
import zipfile

MAX_SIZE_MB = 250
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 250 MB
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 250))


def print_top_10_largest_files(zip_file):
"""Print the top 10 largest files in the given zip file."""
with zipfile.ZipFile(zip_file, 'r') as z:
file_sizes = [(f, z.getinfo(f).file_size) for f in z.namelist()]
file_sizes.sort(key=lambda x: x[1], reverse=True)
for f, size in file_sizes[:10]:
print(f"{f}: {size/(1024*1024)} MBs uncompressed.")
print(f"{f}: {size / (1024 * 1024):.2f} MBs uncompressed.")


def check_wheel_size(directory):
"""Check the size of .whl files in the given directory."""
for root, _, files in os.walk(directory):
for f in files:
if f.endswith(".whl"):
wheel_path = os.path.join(root, f)
wheel_size = os.path.getsize(wheel_path)
wheel_size_mb = wheel_size / (1024 * 1024)
if wheel_size_mb > MAX_SIZE_MB:
print(
f"Wheel {wheel_path} is too large ({wheel_size_mb} MB) "
f"compare to the allowed size ({MAX_SIZE_MB} MB).")
for file_name in files:
if file_name.endswith(".whl"):
wheel_path = os.path.join(root, file_name)
wheel_size_mb = os.path.getsize(wheel_path) / (1024 * 1024)
if wheel_size_mb > VLLM_MAX_SIZE_MB:
print(f"Not allowed: Wheel {wheel_path} is larger "
f"({wheel_size_mb:.2f} MB) than the limit "
f"({VLLM_MAX_SIZE_MB} MB).")
print_top_10_largest_files(wheel_path)
return 1
else:
print(f"Wheel {wheel_path} is within the allowed size "
f"({wheel_size_mb} MB).")
f"({wheel_size_mb:.2f} MB).")
return 0


if __name__ == "__main__":
import sys
sys.exit(check_wheel_size(sys.argv[1]))
if len(sys.argv) < 2:
print("Usage: python check-wheel-size.py <directory>")
sys.exit(1)

directory = sys.argv[1]
sys.exit(check_wheel_size(directory))
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ tasks:
value: 0.664
limit: 1000
num_fewshot: 5
trust_remote_code: True
4 changes: 2 additions & 2 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.409
value: 0.419
- name: "exact_match,flexible-extract"
value: 0.406
value: 0.416
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nvidia/Minitron-4B-Base -b auto -l 1000 -f 5 -t 1
model_name: "nvidia/Minitron-4B-Base"
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m mgoin/Minitron-4B-Base-FP8 -b auto -l 1000 -f 5 -t 1
model_name: "mgoin/Minitron-4B-Base-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.252
value: 0.233
- name: "exact_match,flexible-extract"
value: 0.252
value: 0.236
limit: 1000
num_fewshot: 5
3 changes: 1 addition & 2 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Minitron-4B-Base.yaml
Minitron-4B-Base-FP8.yaml
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
Qwen2-1.5B-Instruct-FP8W8.yaml
Meta-Llama-3-8B-QQQ.yaml
7 changes: 5 additions & 2 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
import numpy
import yaml

RTOL = 0.02
RTOL = 0.05
TEST_DATA_FILE = os.environ.get(
"LM_EVAL_TEST_DATA_FILE",
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")
Expand All @@ -23,9 +23,12 @@


def launch_lm_eval(eval_config):
trust_remote_code = eval_config.get('trust_remote_code', False)

model_args = f"pretrained={eval_config['model_name']}," \
f"tensor_parallel_size={TP_SIZE}," \
f"add_bos_token=true"
f"add_bos_token=true," \
f"trust_remote_code={trust_remote_code}"

results = lm_eval.simple_evaluate(
model="vllm",
Expand Down
9 changes: 5 additions & 4 deletions .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,17 +34,18 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performan

Performance benchmark will be triggered when:
- A PR being merged into vllm.
- Every commit for those PRs with `perf-benchmarks` label.
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.

Nightly benchmark will be triggered when:
- Every commit for those PRs with `nightly-benchmarks` label.
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.




## Performance benchmark details

See [descriptions.md](tests/descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.

See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.


#### Latency test
Expand All @@ -68,7 +69,7 @@ Here is an example of one test inside `latency-tests.json`:

In this example:
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-benchmarks-suite.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`

Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.

Expand Down
2 changes: 1 addition & 1 deletion .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ steps:
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources:
limits:
nvidia.com/gpu: 8
Expand Down
Original file line number Diff line number Diff line change
@@ -1,47 +1,42 @@

## Latency tests

This test suite aims to test vllm's end-to-end latency under a controlled setup.

- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).

### Latency benchmarking results

{latency_tests_markdown_table}

## Throughput tests

This test suite aims to test vllm's throughput.
## Throughput tests

- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.

### Throughput benchmarking results

{throughput_tests_markdown_table}

## Serving tests

This test suite aims to test vllm's real serving metrics.
## Serving tests

- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

### Serving benchmarking results

{serving_tests_markdown_table}


## json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -174,8 +174,8 @@ def results_to_json(latency, throughput, serving):
# document the result
with open(results_folder / "benchmark_results.md", "w") as f:

results = read_markdown(
"../.buildkite/nightly-benchmarks/tests/descriptions.md")
results = read_markdown("../.buildkite/nightly-benchmarks/" +
"performance-benchmarks-descriptions.md")
results = results.format(
latency_tests_markdown_table=latency_md_table,
throughput_tests_markdown_table=throughput_md_table,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,9 @@ check_hf_token() {
ensure_sharegpt_downloaded() {
local FILE=ShareGPT_V3_unfiltered_cleaned_split.json
if [ ! -f "$FILE" ]; then
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/$FILE
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/$FILE
else
echo "$FILE already exists."
echo "$FILE already exists."
fi
}

Expand Down Expand Up @@ -68,35 +68,38 @@ wait_for_server() {
done' && return 0 || return 1
}

kill_gpu_processes() {
# kill all processes on GPU.
pids=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)
if [ -z "$pids" ]; then
echo "No GPU processes found."
kill_processes_launched_by_current_bash() {
# Kill all python processes launched from current bash script
current_shell_pid=$$
processes=$(ps -eo pid,ppid,command | awk -v ppid="$current_shell_pid" -v proc="$1" '$2 == ppid && $3 ~ proc {print $1}')
if [ -n "$processes" ]; then
echo "Killing the following processes matching '$1':"
echo "$processes"
echo "$processes" | xargs kill -9
else
for pid in $pids; do
kill -9 "$pid"
echo "Killed process with PID: $pid"
done

echo "All GPU processes have been killed."
echo "No processes found matching '$1'."
fi
}

kill_gpu_processes() {

# waiting for GPU processes to be fully killed
# loop while nvidia-smi returns any processes
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
ps -aux
lsof -t -i:8000 | xargs -r kill -9
pkill -f pt_main_thread
# this line doesn't work now
# ps aux | grep python | grep openai | awk '{print $2}' | xargs -r kill -9
pkill -f python3
pkill -f /usr/bin/python3


# wait until GPU memory usage smaller than 1GB
while [ $(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1) -ge 1000 ]; do
sleep 1
echo "Waiting for GPU processes to be killed"
done

# remove vllm config file
rm -rf ~/.config/vllm

# Print the GPU memory usage
# so that we know if all GPU processes are killed.
gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
# The memory usage should be 0 MB.
echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
}

upload_to_buildkite() {
Expand All @@ -114,7 +117,7 @@ upload_to_buildkite() {
fi

# Use the determined command to annotate and upload artifacts
$BUILDKITE_AGENT_COMMAND annotate --style "info" --context "$BUILDKITE_LABEL-benchmark-results" < $RESULTS_FOLDER/benchmark_results.md
$BUILDKITE_AGENT_COMMAND annotate --style "info" --context "$BUILDKITE_LABEL-benchmark-results" <$RESULTS_FOLDER/benchmark_results.md
$BUILDKITE_AGENT_COMMAND artifact upload "$RESULTS_FOLDER/*"
}

Expand Down Expand Up @@ -166,7 +169,7 @@ run_latency_tests() {
latency_command: $latency,
gpu_type: $gpu
}')
echo "$jq_output" > "$RESULTS_FOLDER/$test_name.commands"
echo "$jq_output" >"$RESULTS_FOLDER/$test_name.commands"

# run the benchmark
eval "$latency_command"
Expand All @@ -176,7 +179,6 @@ run_latency_tests() {
done
}


run_throughput_tests() {
# run throughput tests using `benchmark_throughput.py`
# $1: a json file specifying throughput test cases
Expand Down Expand Up @@ -224,7 +226,7 @@ run_throughput_tests() {
throughput_command: $command,
gpu_type: $gpu
}')
echo "$jq_output" > "$RESULTS_FOLDER/$test_name.commands"
echo "$jq_output" >"$RESULTS_FOLDER/$test_name.commands"

# run the benchmark
eval "$throughput_command"
Expand Down Expand Up @@ -256,7 +258,6 @@ run_serving_tests() {
continue
fi


# get client and server arguments
server_params=$(echo "$params" | jq -r '.server_parameters')
client_params=$(echo "$params" | jq -r '.client_parameters')
Expand Down Expand Up @@ -334,7 +335,7 @@ run_serving_tests() {
client_command: $client,
gpu_type: $gpu
}')
echo "$jq_output" > "$RESULTS_FOLDER/${new_test_name}.commands"
echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"

done

Expand All @@ -351,6 +352,7 @@ main() {
# dependencies
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get update && apt-get -y install jq)
(which lsof) || (apt-get update && apt-get install -y lsof)

# get the current IP address, required by benchmark_serving.py
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
Expand All @@ -369,7 +371,6 @@ main() {
run_latency_tests $QUICK_BENCHMARK_ROOT/tests/latency-tests.json
run_throughput_tests $QUICK_BENCHMARK_ROOT/tests/throughput-tests.json


# postprocess benchmarking results
pip install tabulate pandas
python3 $QUICK_BENCHMARK_ROOT/scripts/convert-results-json-to-markdown.py
Expand Down
4 changes: 2 additions & 2 deletions .buildkite/nightly-benchmarks/tests/latency-tests.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
{
"test_name": "latency_llama8B_tp1",
"parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num_iters_warmup": 5,
Expand All @@ -12,7 +12,7 @@
{
"test_name": "latency_llama70B_tp4",
"parameters": {
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"tensor_parallel_size": 4,
"load_format": "dummy",
"num-iters-warmup": 5,
Expand Down
Loading
Loading