Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge 24 10 21 #240

Merged
merged 299 commits into from
Oct 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
299 commits
Select commit Hold shift + click to select a range
1cabfce
[Misc] Adjust max_position_embeddings for LoRA compatibility (#8957)
jeejeelee Sep 30, 2024
1425a1b
[ci] Add CODEOWNERS for test directories (#8795)
khluu Oct 1, 2024
bce3244
[CI][SpecDecode] Fix spec decode tests, use flash attention backend f…
LiuXiaoxuanPKU Oct 1, 2024
062c89e
[Frontend][Core] Move guided decoding params into sampling params (#8…
joerunde Oct 1, 2024
aaccca2
[CI/Build] Fix machete generated kernel files ordering (#8976)
khluu Oct 1, 2024
7da2487
[torch.compile] fix tensor alias (#8982)
youkaichao Oct 1, 2024
82f3937
[Misc] add process_weights_after_loading for DummyLoader (#8969)
divakar-amd Oct 1, 2024
bc4eb65
[Bugfix] Fix Fuyu tensor parallel inference (#8986)
Isotr0py Oct 1, 2024
1fe0a42
[Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provid…
alex-jw-brooks Oct 1, 2024
35bd215
[Core] [Frontend] Priority scheduling for embeddings and in the OpenA…
schoennenbeck Oct 1, 2024
4f341bd
[Doc] Update list of supported models (#8987)
DarkLight1337 Oct 1, 2024
22f5851
Update benchmark_serving.py to read and write json-datasets, results …
vlsav Oct 1, 2024
1570203
[Spec Decode] (1/2) Remove batch expansion (#8839)
LiuXiaoxuanPKU Oct 1, 2024
563649a
[Core] Combined support for multi-step scheduling, chunked prefill & …
afeldman-nm Oct 2, 2024
7f60520
[Misc] Update Default Image Mapper Error Log (#8977)
alex-jw-brooks Oct 2, 2024
afb050b
[Core] CUDA Graphs for Multi-Step + Chunked-Prefill (#8645)
varun-sundar-rabindranath Oct 2, 2024
f58d4fc
[OpenVINO] Enable GPU support for OpenVINO vLLM backend (#8192)
sshlyapn Oct 2, 2024
19f0d25
[Model] Adding Granite MoE. (#8206)
shawntan Oct 3, 2024
18c2e30
[Doc] Update Granite model docs (#9025)
njhill Oct 3, 2024
19a4dd0
[Bugfix] example template should not add parallel_tool_prompt if tool…
tjohnson31415 Oct 3, 2024
01843c8
[Misc] log when using default MoE config (#8971)
divakar-amd Oct 3, 2024
83caf35
[BugFix] Enforce Mistral ToolCall id constraint when using the Mistra…
gcalmettes Oct 3, 2024
f5d72b2
[Core] Make BlockSpaceManagerV2 the default BlockManager to use. (#8678)
sroy745 Oct 3, 2024
63e3993
[Frontend] [Neuron] Parse literals out of override-neuron-config (#8959)
xendo Oct 3, 2024
9aaf14c
[misc] add forward context for attention (#9029)
youkaichao Oct 3, 2024
91add85
Fix failing spec decode test (#9054)
sroy745 Oct 3, 2024
2838d6b
[Bugfix] Weight loading fix for OPT model (#9042)
domenVres Oct 3, 2024
3dbb215
[Frontend][Feature] support tool calling for internlm/internlm2_5-7b-…
sydnash Oct 4, 2024
aeb37c2
[CI/Build] Per file CUDA Archs (improve wheel size and dev build time…
LucasWilkinson Oct 4, 2024
303d447
[Misc] Enable multi-step output streaming by default (#9047)
mgoin Oct 4, 2024
0f6d7a9
[Models] Add remaining model PP support (#7168)
andoorve Oct 4, 2024
0e36fd4
[Misc] Move registry to its own file (#9064)
DarkLight1337 Oct 4, 2024
3d826d2
[Bugfix] Reshape the dimensions of the input image embeddings in Qwen…
whyiug Oct 4, 2024
22482e4
[Bugfix] Flash attention arches not getting set properly (#9062)
LucasWilkinson Oct 4, 2024
9ade8bb
[Model] add a bunch of supported lora modules for mixtral (#9008)
prashantgupta24 Oct 4, 2024
36eecfb
Remove AMD Ray Summit Banner (#9075)
simon-mo Oct 4, 2024
e5dc713
[Hardware][PowerPC] Make oneDNN dependency optional for Power (#9039)
varad-ahirwadkar Oct 4, 2024
26aa325
[Core][VLM] Test registration for OOT multimodal models (#8717)
ywang96 Oct 4, 2024
0dcc8cb
Adds truncate_prompt_tokens param for embeddings creation (#8999)
flaviabeo Oct 4, 2024
05d6864
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE…
ElizaWszola Oct 4, 2024
fbb7442
[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add…
KuntaiDu Oct 4, 2024
05c531b
[Misc] Improved prefix cache example (#9077)
Imss27 Oct 4, 2024
0cc566c
[Misc] Add random seed for prefix cache benchmark (#9081)
Imss27 Oct 4, 2024
27302dd
[Misc] Fix CI lint (#9085)
comaniac Oct 4, 2024
cc90419
[Hardware][Neuron] Add on-device sampling support for Neuron (#8746)
chongmni-aws Oct 4, 2024
663874e
[torch.compile] improve allreduce registration (#9061)
youkaichao Oct 4, 2024
a95354a
[Doc] Update README.md with Ray summit slides (#9088)
zhuohan123 Oct 5, 2024
dac914b
[Bugfix] use blockmanagerv1 for encoder-decoder (#9084)
heheda12345 Oct 5, 2024
53b3a33
[Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs (#8979)
hhzhang16 Oct 5, 2024
15986f5
[Model] Support Gemma2 embedding model (#9004)
xyang16 Oct 5, 2024
cfadb9c
[Bugfix] Deprecate registration of custom configs to huggingface (#9083)
heheda12345 Oct 5, 2024
5df1834
[Bugfix] Fix order of arguments matters in config.yaml (#8960)
Imss27 Oct 5, 2024
f4dd830
[core] use forward context for flash infer (#9097)
youkaichao Oct 6, 2024
23fea87
[Bugfix] Fix try-catch conditions to import correct Flash Attention B…
tjtanaa Oct 6, 2024
168cab6
[Frontend] API support for beam search (#9087)
LunrEclipse Oct 6, 2024
f22619f
[Misc] Remove user-facing error for removed VLM args (#9104)
DarkLight1337 Oct 6, 2024
b22b798
[Model] PP support for embedding models and update docs (#9090)
DarkLight1337 Oct 6, 2024
fdf59d3
[Bugfix] fix tool_parser error handling when serve a model not suppor…
liuyanyi Oct 6, 2024
cb3b2b9
[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step s…
varun-sundar-rabindranath Oct 6, 2024
487678d
[Bugfix][Hardware][CPU] Fix CPU model input for decode (#9044)
Isotr0py Oct 7, 2024
c8f26bb
[BugFix][Core] Fix BlockManagerV2 when Encoder Input is None (#9103)
sroy745 Oct 7, 2024
18b296f
[core] remove beam search from the core (#9105)
youkaichao Oct 7, 2024
8c6de96
[Model] Explicit interface for vLLM models and support OOT embedding …
DarkLight1337 Oct 7, 2024
4f95ffe
[Hardware][CPU] Cross-attention and Encoder-Decoder models support on…
Isotr0py Oct 7, 2024
f19da64
[Core] Refactor GGUF parameters packing and forwarding (#8859)
Isotr0py Oct 7, 2024
151ef4e
[Model] Support NVLM-D and fix QK Norm in InternViT (#9045)
DarkLight1337 Oct 7, 2024
93cf74a
[Doc]: Add deploying_with_k8s guide (#8451)
haitwang-cloud Oct 7, 2024
e0dbdb0
[CI/Build] Add linting for github actions workflows (#7876)
russellb Oct 7, 2024
c0d9a98
[Doc] Include performance benchmark in README (#9135)
KuntaiDu Oct 7, 2024
fa45513
[misc] fix comment and variable name (#9139)
youkaichao Oct 7, 2024
8eeb857
Add Slack to README (#9137)
simon-mo Oct 8, 2024
04c12f8
[misc] update utils to support comparing multiple settings (#9140)
youkaichao Oct 8, 2024
80b57f0
[Intel GPU] Fix xpu decode input (#9145)
jikunshang Oct 8, 2024
e1faa2a
[misc] improve ux on readme (#9147)
youkaichao Oct 8, 2024
8c74622
[Frontend] API support for beam search for MQLLMEngine (#9117)
LunrEclipse Oct 8, 2024
a3691b6
[Core][Frontend] Add Support for Inference Time mm_processor_kwargs (…
alex-jw-brooks Oct 8, 2024
1992aa8
Factor out common weight loading code
DarkLight1337 Oct 8, 2024
e81645d
Fix EAGLE model loading
DarkLight1337 Oct 8, 2024
069d3bd
[Frontend] Add Early Validation For Chat Template / Tool Call Parser …
alex-jw-brooks Oct 8, 2024
4ef043b
Improve efficiency
DarkLight1337 Oct 8, 2024
e723680
Rename
DarkLight1337 Oct 8, 2024
c60e921
Update LLaVA-NeXT-Video
DarkLight1337 Oct 8, 2024
cfba685
[CI/Build] Add examples folder into Docker image so that we can lever…
panpan0000 Oct 8, 2024
9a94ca4
[Bugfix] fix OpenAI API server startup with --disable-frontend-multip…
dtrifiro Oct 8, 2024
9f12890
Automatic loading and save memory
DarkLight1337 Oct 8, 2024
10b5b0e
Rename
DarkLight1337 Oct 8, 2024
ce08df5
Update docstring
DarkLight1337 Oct 8, 2024
df687ac
Simplify
DarkLight1337 Oct 8, 2024
98bf417
Cleanup
DarkLight1337 Oct 8, 2024
decc7a4
Fully enable recursive loading
DarkLight1337 Oct 8, 2024
e59201a
Clarify
DarkLight1337 Oct 8, 2024
1874c6a
[Doc] Update vlm.rst to include an example on videos (#9155)
sayakpaul Oct 8, 2024
f538ab9
Fix incorrect semantics
DarkLight1337 Oct 8, 2024
f077865
Move function
DarkLight1337 Oct 8, 2024
56e4a33
Update error message
DarkLight1337 Oct 8, 2024
85c63c8
Fix Ultravox loading
DarkLight1337 Oct 8, 2024
42a3253
spacing
DarkLight1337 Oct 8, 2024
b21ccdf
Merge remote-tracking branch 'upstream/main'
gshtras Oct 8, 2024
e5a7def
Merge remote-tracking branch 'upstream/main' into main
gshtras Oct 8, 2024
3e72cae
Merge remote-tracking branch 'upstream/fix-weight-loading' into main
gshtras Oct 8, 2024
674b2a5
Merge remote-tracking branch 'origin/main' into upstream_merge_24_10_08
gshtras Oct 8, 2024
de24046
[Doc] Improve contributing and installation documentation (#9132)
rafvasq Oct 8, 2024
390efcb
Fix server
gshtras Oct 8, 2024
bd37b9f
[Bugfix] Try to handle older versions of pytorch (#9086)
bnellnm Oct 8, 2024
8fa419f
Merge remote-tracking branch 'upstream/main' into upstream_merge_24_1…
gshtras Oct 8, 2024
2a13196
mypy: check additional directories (#9162)
russellb Oct 8, 2024
9ba0bd6
Add `lm-eval` directly to requirements-test.txt (#9161)
mgoin Oct 9, 2024
2f4117c
support bitsandbytes quantization with more models (#9148)
chenqianfzh Oct 9, 2024
ffc4b27
Add classifiers in setup.py (#9171)
terrytangyuan Oct 9, 2024
acce763
Update link to KServe deployment guide (#9173)
terrytangyuan Oct 9, 2024
480b7f4
[Misc] Improve validation errors around best_of and n (#9167)
tjohnson31415 Oct 9, 2024
7627172
[Bugfix][Doc] Report neuron error in output (#9159)
joerowell Oct 9, 2024
cdc72e3
[Model] Remap FP8 kv_scale in CommandR and DBRX (#9174)
hliuca Oct 9, 2024
0b5b5d7
[Frontend] Log the maximum supported concurrency (#8831)
AlpinDale Oct 9, 2024
8bfaa4e
[Bugfix] fix composite weight loading and EAGLE weight loading (#9160)
DarkLight1337 Oct 9, 2024
c8627cd
[ci][test] use load dummy for testing (#9165)
youkaichao Oct 9, 2024
dc4aea6
[Doc] Fix VLM prompt placeholder sample bug (#9170)
ycool Oct 9, 2024
21906a6
[Bugfix] Fix lora loading for Compressed Tensors in #9120 (#9179)
fahadh4ilyas Oct 9, 2024
cfaa600
[Bugfix] Access `get_vocab` instead of `vocab` in tool parsers (#9188)
DarkLight1337 Oct 9, 2024
7dea289
Add Dependabot configuration for GitHub Actions updates (#1217)
EwoutH Oct 9, 2024
ca77dd7
[Hardware][CPU] Support AWQ for CPU backend (#7515)
bigPYJ1151 Oct 9, 2024
cdca899
[CI/Build] mypy: check vllm/entrypoints (#9194)
russellb Oct 9, 2024
d5fbb87
[CI/Build] Update Dockerfile install+deploy image to ubuntu 22.04 (#9…
mgoin Oct 9, 2024
cf25b93
[Core] Fix invalid args to _process_request (#9201)
russellb Oct 10, 2024
de895f1
[misc] improve model support check in another process (#9208)
youkaichao Oct 10, 2024
ce00231
[Bugfix] Fix Weight Loading Multiple GPU Test - Large Models (#9213)
mgoin Oct 10, 2024
a64e7b9
[Bugfix] Machete garbage results for some models (large K dim) (#9212)
LucasWilkinson Oct 10, 2024
f3a507f
[Core] Add an environment variable which needs to be set explicitly t…
sroy745 Oct 10, 2024
07c11cf
[Bugfix] Fix lm_head weights tying with lora for llama (#9227)
Isotr0py Oct 10, 2024
04de905
[Model] support input image embedding for minicpmv (#9237)
whyiug Oct 10, 2024
83ea5c7
[OpenVINO] Use torch 2.4.0 and newer optimim version (#9121)
ilya-lavrenov Oct 10, 2024
18511ae
[Bugfix] Fix Machete unittests failing with `NotImplementedError` (#9…
LucasWilkinson Oct 10, 2024
055f327
[Doc] Improve debugging documentation (#9204)
rafvasq Oct 10, 2024
21efb60
[CI/Build] Make the `Dockerfile.cpu` file's `PIP_EXTRA_INDEX_URL` Co…
jyono Oct 10, 2024
78c0b41
Suggest codeowners for the core componenets (#9210)
simon-mo Oct 10, 2024
e4d652e
[torch.compile] integration with compilation control (#9058)
youkaichao Oct 10, 2024
9cc811c
Bump actions/github-script from 6 to 7 (#9197)
dependabot[bot] Oct 10, 2024
270953b
Bump actions/checkout from 3 to 4 (#9196)
dependabot[bot] Oct 10, 2024
fb870fd
Bump actions/setup-python from 3 to 5 (#9195)
dependabot[bot] Oct 10, 2024
a78c6ba
[ci/build] Add placeholder command for custom models test (#9262)
khluu Oct 10, 2024
e00c094
[torch.compile] generic decorators (#9258)
youkaichao Oct 10, 2024
f990bab
[Doc][Neuron] add note to neuron documentation about resolving triton…
omrishiv Oct 10, 2024
94bf9ae
[Misc] Fix sampling from sonnet for long context case (#9235)
Imss27 Oct 11, 2024
cbc2ef5
[misc] hide best_of from engine (#9261)
youkaichao Oct 11, 2024
e808156
[Misc] Collect model support info in a single process per model (#9233)
DarkLight1337 Oct 11, 2024
36ea790
[Misc][LoRA] Support loading LoRA weights for target_modules in reg f…
jeejeelee Oct 11, 2024
df3dcdf
[Bugfix] Fix priority in multiprocessing engine (#9277)
schoennenbeck Oct 11, 2024
7342a7d
[Model] Support Mamba (#6484)
tlrmchlsmth Oct 11, 2024
f710090
[Kernel] adding fused moe kernel config for L40S TP4 (#9245)
bringlein Oct 11, 2024
6cf1167
[Model] Add GLM-4v support and meet vllm==0.6.2 (#9242)
sixsixcoder Oct 11, 2024
1a18238
[Doc] Remove outdated comment to avoid misunderstanding (#9287)
homeffjy Oct 11, 2024
8baf85e
[Doc] Compatibility matrix for mutual exclusive features (#8512)
wallashss Oct 11, 2024
de9fb4b
[Bugfix][CI/Build] Fix docker build where CUDA archs < 7.0 are being …
LucasWilkinson Oct 11, 2024
c6cf929
[Bugfix] Sets `is_first_step_output` for TPUModelRunner (#9202)
allenwang28 Oct 11, 2024
d11b46f
[bugfix] fix f-string for error (#9295)
prashantgupta24 Oct 12, 2024
ec10cb8
[BugFix] Fix tool call finish reason in streaming case (#9209)
maxdebayser Oct 12, 2024
89feb4c
[SpecDec] Remove Batch Expansion (2/3) (#9298)
LiuXiaoxuanPKU Oct 12, 2024
00298e0
[Bugfix] Fix bug of xformer prefill for encoder-decoder (#9026)
xiangxu-google Oct 12, 2024
2b184dd
[Misc][Installation] Improve source installation script and doc (#9309)
cermeng Oct 12, 2024
250e26a
[Bugfix]Fix MiniCPM's LoRA bug (#9286)
jeejeelee Oct 12, 2024
f519902
[CI] Fix merge conflict (#9317)
LiuXiaoxuanPKU Oct 13, 2024
16b24e7
[Bugfix] Bandaid fix for speculative decoding tests (#9327)
tlrmchlsmth Oct 13, 2024
dfe43a2
[Model] Molmo vLLM Integration (#9016)
mrsalehi Oct 14, 2024
4141608
[Hardware][intel GPU] add async output process for xpu (#8897)
jikunshang Oct 14, 2024
203ab8f
[CI/Build] setuptools-scm fixes (#8900)
dtrifiro Oct 14, 2024
fd47e57
[Docs] Remove PDF build from Readtehdocs (#9347)
simon-mo Oct 14, 2024
473e7b3
[TPU] Fix TPU SMEM OOM by Pallas paged attention kernel (#9350)
WoosukKwon Oct 14, 2024
4d31cd4
[Frontend] merge beam search implementations (#9296)
LunrEclipse Oct 14, 2024
f0fe4fe
[Model] Make llama3.2 support multiple and interleaved images (#9095)
xiangxu-google Oct 14, 2024
169b530
[Bugfix] Clean up some cruft in mamba.py (#9343)
tlrmchlsmth Oct 15, 2024
44eaa5a
[Frontend] Clarify model_type error messages (#9345)
stevegrubb Oct 15, 2024
8e836d9
[Doc] Fix code formatting in spec_decode.rst (#9348)
mgoin Oct 15, 2024
55e081f
[Bugfix] Update InternVL input mapper to support image embeds (#9351)
hhzhang16 Oct 15, 2024
e9d517f
[BugFix] Fix chat API continuous usage stats (#9357)
njhill Oct 15, 2024
5d264f4
pass ignore_eos parameter to all benchmark_serving calls (#9349)
gracehonv Oct 15, 2024
22f8a69
[Misc] Directly use compressed-tensors for checkpoint definitions (#8…
mgoin Oct 15, 2024
ba30942
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with emp…
CatherineSue Oct 15, 2024
717a5f8
[Bugfix][CI/Build] Fix CUDA 11.8 Build (#9386)
LucasWilkinson Oct 16, 2024
ed92013
[Bugfix] Molmo text-only input bug fix (#9397)
mrsalehi Oct 16, 2024
7e7eae3
[Misc] Standardize RoPE handling for Qwen2-VL (#9250)
DarkLight1337 Oct 16, 2024
7abba39
[Model] VLM2Vec, the first multimodal embedding model in vLLM (#9303)
DarkLight1337 Oct 16, 2024
1de76a0
[CI/Build] Test VLM embeddings (#9406)
DarkLight1337 Oct 16, 2024
cee711f
[Core] Rename input data types (#8688)
DarkLight1337 Oct 16, 2024
59230ef
[Misc] Consolidate example usage of OpenAI client for multimodal mode…
ywang96 Oct 16, 2024
cf1d62a
[Model] Support SDPA attention for Molmo vision backbone (#9410)
Isotr0py Oct 16, 2024
415f76a
Support mistral interleaved attn (#9414)
patrickvonplaten Oct 16, 2024
fb60ae9
[Kernel][Model] Improve continuous batching for Jamba and Mamba (#9189)
mzusman Oct 16, 2024
5b8a1fd
[Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCP…
streaver91 Oct 16, 2024
8345045
[Performance][Spec Decode] Optimize ngram lookup performance (#9333)
LiuXiaoxuanPKU Oct 16, 2024
776dbd7
[CI/Build] mypy: Resolve some errors from checking vllm/engine (#9267)
russellb Oct 16, 2024
c3fab5f
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token qu…
tlrmchlsmth Oct 16, 2024
92d86da
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels (#9391)
rasmith Oct 17, 2024
dbfa8d3
Add notes on the use of Slack (#9442)
terrytangyuan Oct 17, 2024
e312e52
[Kernel] Add Exllama as a backend for compressed-tensors (#9395)
LucasWilkinson Oct 17, 2024
390be74
[Misc] Print stack trace using `logger.exception` (#9461)
DarkLight1337 Oct 17, 2024
9d30a05
[misc] CUDA Time Layerwise Profiler (#8337)
LucasWilkinson Oct 17, 2024
5e443b5
[Bugfix] Allow prefill of assistant response when using `mistral_comm…
sasha0552 Oct 17, 2024
8e1cddc
[TPU] Call torch._sync(param) during weight loading (#9437)
WoosukKwon Oct 17, 2024
5eda21e
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support (#9344)
bigPYJ1151 Oct 17, 2024
81ede99
[Core] Deprecating block manager v1 and make block manager v2 default…
KuntaiDu Oct 17, 2024
a2c71c5
[CI/Build] remove .github from .dockerignore, add dirty repo check (#…
dtrifiro Oct 17, 2024
7871659
[Misc] Remove commit id file (#9470)
DarkLight1337 Oct 17, 2024
0f41fbe
[torch.compile] Fine-grained CustomOp enabling mechanism (#9300)
ProExpertProg Oct 17, 2024
eca2c5f
[Bugfix] Fix support for dimension like integers and ScalarType (#9299)
bnellnm Oct 17, 2024
d65049d
[Bugfix] Add random_seed to sample_hf_requests in benchmark_serving s…
wukaixingxp Oct 17, 2024
d615b5c
[Bugfix] Print warnings related to `mistral_common` tokenizer only on…
sasha0552 Oct 17, 2024
bb76538
[Hardwware][Neuron] Simplify model load for transformers-neuronx libr…
sssrijan-amazon Oct 17, 2024
343f8e0
Support `BERTModel` (first `encoder-only` embedding model) (#9056)
robertgshaw2-neuralmagic Oct 17, 2024
48138a8
[BugFix] Stop silent failures on compressed-tensors parsing (#9381)
dsikka Oct 18, 2024
de4008e
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory u…
joerunde Oct 18, 2024
154a8ae
[Qwen2.5] Support bnb quant for Qwen2.5 (#9467)
blueyo0 Oct 18, 2024
944dd8e
[CI/Build] Use commit hash references for github actions (#9430)
russellb Oct 18, 2024
1ffc8a7
[BugFix] Typing fixes to RequestOutput.prompt and beam search (#9473)
njhill Oct 18, 2024
d2b1bf5
[Frontend][Feature] Add jamba tool parser (#9154)
tomeras91 Oct 18, 2024
25aeb7d
[BugFix] Fix and simplify completion API usage streaming (#9475)
njhill Oct 18, 2024
1bbbcc0
[CI/Build] Fix lint errors in mistral tokenizer (#9504)
DarkLight1337 Oct 18, 2024
ae8b633
[Bugfix] Fix offline_inference_with_prefix.py (#9505)
tlrmchlsmth Oct 18, 2024
7dbe738
[Misc] benchmark: Add option to set max concurrency (#9390)
russellb Oct 18, 2024
051eaf6
[Model] Add user-configurable task for models that support both gener…
DarkLight1337 Oct 18, 2024
67a7e5e
[CI/Build] Add error matching config for mypy (#9512)
russellb Oct 18, 2024
3921a2f
[Model] Support Pixtral models in the HF Transformers format (#9036)
mgoin Oct 18, 2024
9bb10a7
[MISC] Add lora requests to metrics (#9477)
coolkp Oct 18, 2024
d11bf43
[MISC] Consolidate cleanup() and refactor offline_inference_with_pref…
comaniac Oct 18, 2024
0c9a525
[Kernel] Add env variable to force flashinfer backend to enable tenso…
tdoublep Oct 19, 2024
337ed76
[Bugfix] Fix offline mode when using `mistral_common` (#9457)
sasha0552 Oct 19, 2024
380e186
:bug: fix torch memory profiling (#9516)
joerunde Oct 19, 2024
1325872
[Frontend] Avoid creating guided decoding LogitsProcessor unnecessari…
njhill Oct 19, 2024
82c2515
[Doc] update gpu-memory-utilization flag docs (#9507)
joerunde Oct 19, 2024
dfd951e
[CI/Build] Add error matching for ruff output (#9513)
russellb Oct 19, 2024
85dc92f
[CI/Build] Configure matcher for actionlint workflow (#9511)
russellb Oct 19, 2024
c5eea3c
[Frontend] Support simpler image input format (#9478)
yue-anyscale Oct 19, 2024
263d8ee
[Bugfix] Fix missing task for speculative decoding (#9524)
DarkLight1337 Oct 19, 2024
8e3e7f2
[Model][Pixtral] Optimizations for input_processor_for_pixtral_hf (#9…
mgoin Oct 19, 2024
5b59fe0
[Bugfix] Pass json-schema to GuidedDecodingParams and make test stron…
heheda12345 Oct 20, 2024
962d2c6
[Model][Pixtral] Use memory_efficient_attention for PixtralHFVision (…
mgoin Oct 20, 2024
4fa3e33
[Kernel] Support sliding window in flash attention backend (#9403)
heheda12345 Oct 20, 2024
855e0e6
[Frontend][Misc] Goodput metric support (#9338)
Imss27 Oct 20, 2024
696b01a
[CI/Build] Split up decoder-only LM tests (#9488)
DarkLight1337 Oct 21, 2024
496e991
[Doc] Consistent naming of attention backends (#9498)
tdoublep Oct 21, 2024
f6b9729
[Model] FalconMamba Support (#9325)
dhiaEddineRhaiem Oct 21, 2024
8ca8954
[Bugfix][Misc]: fix graph capture for decoder (#9549)
yudian0504 Oct 21, 2024
ec6bd6c
[BugFix] Use correct python3 binary in Docker.ppc64le entrypoint (#9492)
varad-ahirwadkar Oct 21, 2024
5241aa1
[Model][Bugfix] Fix batching with multi-image in PixtralHF (#9518)
mgoin Oct 21, 2024
6e79dcf
Merge remote-tracking branch 'origin/main' into upstream_merge_24_10_21
gshtras Oct 21, 2024
b10dad1
Merge remote-tracking branch 'upstream/main' into upstream_merge_24_1…
gshtras Oct 21, 2024
634d9b0
yapf
gshtras Oct 21, 2024
af76c9d
Merge branch 'main' into upstream_merge_24_10_21
gshtras Oct 22, 2024
a594c0c
Merge branch 'main' into upstream_merge_24_10_21
gshtras Oct 22, 2024
87e3970
Merge branch 'main' into upstream_merge_24_10_21
gshtras Oct 22, 2024
be448fb
Merge branch 'main' into upstream_merge_24_10_21
gshtras Oct 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@9516087b81a61d0e220b22cc1b75be76de23bc10
# pip install lm-eval==0.4.4

usage() {
echo``
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.3
# pip install lm-eval==0.4.4

usage() {
echo``
Expand Down
4 changes: 2 additions & 2 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ steps:
agents:
queue: cpu_queue
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg buildkite_commit=$BUILDKITE_COMMIT --build-arg USE_SCCACHE=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
# rename the files to change linux -> manylinux1
Expand All @@ -22,7 +22,7 @@ steps:
agents:
queue: cpu_queue
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg buildkite_commit=$BUILDKITE_COMMIT --build-arg USE_SCCACHE=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
# rename the files to change linux -> manylinux1
Expand Down
8 changes: 7 additions & 1 deletion .buildkite/run-cpu-test-ppc64le.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,13 @@ docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/hugg
# Run basic model test
docker exec cpu-test bash -c "
pip install pytest matplotlib einops transformers_stream_generator
pytest -v -s tests/models -m \"not vlm\" --ignore=tests/models/test_embedding.py --ignore=tests/models/test_oot_registration.py --ignore=tests/models/test_registry.py --ignore=tests/models/test_jamba.py --ignore=tests/models/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported
pytest -v -s tests/models -m \"not vlm\" \
--ignore=tests/models/test_embedding.py \
--ignore=tests/models/test_oot_registration.py \
--ignore=tests/models/test_registry.py \
--ignore=tests/models/test_jamba.py \
--ignore=tests/models/test_mamba.py \
--ignore=tests/models/test_danube3_4b.py" # Mamba kernels and Danube3-4B on CPU is not supported

# online inference
docker exec cpu-test bash -c "
Expand Down
9 changes: 8 additions & 1 deletion .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,20 @@ docker exec cpu-test bash -c "
pytest -v -s tests/models/decoder_only/language \
--ignore=tests/models/test_fp8.py \
--ignore=tests/models/decoder_only/language/test_jamba.py \
--ignore=tests/models/decoder_only/language/test_mamba.py \
--ignore=tests/models/decoder_only/language/test_granitemoe.py \
--ignore=tests/models/decoder_only/language/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported

# Run compressed-tensor test
docker exec cpu-test bash -c "
pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynanmic_per_token"
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynamic_per_token"

# Run AWQ test
docker exec cpu-test bash -c "
pytest -s -v \
tests/quantization/test_ipex_quant.py"

# online inference
docker exec cpu-test bash -c "
Expand Down
30 changes: 21 additions & 9 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,6 @@ steps:
- vllm/
commands:
- pip install -e ./plugins/vllm_add_dummy_model
- pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@a4987bba6e9e9b3f22bd3a6c1ecf0abd04fd5622#egg=lm_eval[api]
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py
- pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
Expand All @@ -118,7 +117,9 @@ steps:
- vllm/core/
- tests/distributed
- tests/spec_decode/e2e/test_integration_dist_tp4
- tests/compile
commands:
- pytest -v -s compile/test_basic_correctness.py
- pytest -v -s distributed/test_pynccl.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py

Expand Down Expand Up @@ -179,6 +180,7 @@ steps:
- python3 offline_inference_vision_language_multi_image.py
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
- python3 offline_inference_encoder_decoder.py
- python3 offline_profile.py --model facebook/opt-125m

- label: Prefix Caching Test # 9min
#mirror_hardwares: [amd]
Expand Down Expand Up @@ -226,7 +228,7 @@ steps:
- vllm/
- tests/compile
commands:
- pytest -v -s compile/test_full_graph_smoke.py
- pytest -v -s compile/test_basic_correctness.py

- label: "PyTorch Fullgraph Test" # 18min
source_file_dependencies:
Expand Down Expand Up @@ -270,15 +272,14 @@ steps:
- csrc/
- vllm/model_executor/layers/quantization
- tests/quantization
command: pytest -v -s quantization
command: VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization

- label: LM Eval Small Models # 53min
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
source_file_dependencies:
- csrc/
- vllm/model_executor/layers/quantization
commands:
- pip install lm-eval
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- bash ./run-tests.sh -c configs/models-small.txt -t 1

Expand Down Expand Up @@ -309,13 +310,22 @@ steps:
- pytest -v -s models/test_oot_registration.py # it needs a clean process
- pytest -v -s models/*.py --ignore=models/test_oot_registration.py

- label: Decoder-only Language Models Test # 1h36min
- label: Decoder-only Language Models Test (Standard) # 35min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/models/decoder_only/language
commands:
- pytest -v -s models/decoder_only/language
- pytest -v -s models/decoder_only/language/test_models.py
- pytest -v -s models/decoder_only/language/test_big_models.py

- label: Decoder-only Language Models Test (Extended) # 1h20min
nightly: true
source_file_dependencies:
- vllm/
- tests/models/decoder_only/language
commands:
- pytest -v -s models/decoder_only/language --ignore=models/decoder_only/language/test_models.py --ignore=models/decoder_only/language/test_big_models.py

- label: Decoder-only Multi-Modal Models Test # 1h31min
#mirror_hardwares: [amd]
Expand All @@ -332,17 +342,20 @@ steps:
source_file_dependencies:
- vllm/
- tests/models/embedding/language
- tests/models/embedding/vision_language
- tests/models/encoder_decoder/language
- tests/models/encoder_decoder/vision_language
commands:
- pytest -v -s models/embedding/language
- pytest -v -s models/embedding/vision_language
- pytest -v -s models/encoder_decoder/language
- pytest -v -s models/encoder_decoder/vision_language

# This test is used only in PR development phase to test individual models and should never run on main
- label: Custom Models Test
#mirror_hardwares: [amd]
optional: true
commands:
- echo 'Testing custom models...'
# PR authors can temporarily add commands below to test individual models
# e.g. pytest -v -s models/encoder_decoder/vision_language/test_mllama.py
# *To avoid merge conflicts, remember to REMOVE (not just comment out) them before merging the PR*
Expand Down Expand Up @@ -390,7 +403,7 @@ steps:
- tests/distributed/
- vllm/compilation
commands:
- pytest -v -s ./compile/test_full_graph_multi_gpu.py
- pytest -v -s ./compile/test_basic_correctness.py
- pytest -v -s ./compile/test_wrapper.py
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep -q 'Same node test passed'
- TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m distributed_2_gpus
Expand Down Expand Up @@ -492,6 +505,5 @@ steps:
- csrc/
- vllm/model_executor/layers/quantization
commands:
- pip install lm-eval
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- bash ./run-tests.sh -c configs/models-large.txt -t 4
31 changes: 29 additions & 2 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,33 @@
/.github/
/.venv
/build
dist
Dockerfile*
vllm/*.so

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

.mypy_cache

# Distribution / packaging
.Python
/build/
cmake-build-*/
CMakeUserPresets.json
develop-eggs/
/dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
21 changes: 16 additions & 5 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -1,19 +1,30 @@
# See https://help.github.com/articles/about-codeowners/
# for more info about CODEOWNERS file

# This lists cover the "core" components of vLLM that require careful review
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/core @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/engine/llm_engine.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/executor/executor_base.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/worker/worker_base.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/worker/worker.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/model_executor/layers/sampler.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
CMakeLists.txt @tlrmchlsmth @WoosukKwon

# Test ownership
/tests/async_engine @njhill @robertgshaw2-neuralmagic @simon-mo
/tests/test_inputs.py @DarkLight1337 @ywang96
/tests/entrypoints @DarkLight1337 @robertgshaw2-neuralmagic @simon-mo
/tests/entrypoints @DarkLight1337 @robertgshaw2-neuralmagic @simon-mo
/tests/models @DarkLight1337 @ywang96
/tests/multimodal @DarkLight1337 @ywang96
/tests/prefix_caching @comaniac @KuntaiDu
/tests/prefix_caching @comaniac @KuntaiDu
/tests/spec_decode @njhill @LiuXiaoxuanPKU
/tests/kernels @tlrmchlsmth @WoosukKwon
/tests/kernels @tlrmchlsmth @WoosukKwon
/tests/quantization @mgoin @robertgshaw2-neuralmagic
/.buildkite/lm-eval-harness @mgoin @simon-mo
/.buildkite/lm-eval-harness @mgoin @simon-mo
/tests/distributed/test_multi_node_assignment.py @youkaichao
/tests/distributed/test_pipeline_parallel.py @youkaichao
/tests/distributed/test_same_node.py @youkaichao
/tests/multi_step @alexm-neuralmagic @SolitaryThinker @comaniac
/tests/multi_step @alexm-neuralmagic @comaniac
/tests/weight_loading @mgoin @youkaichao
/tests/basic_correctness/test_chunked_prefill @rkooo567 @comaniac
7 changes: 7 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
version: 2
updates:
# Maintain dependencies for GitHub Actions
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "weekly"
3 changes: 2 additions & 1 deletion .github/workflows/actionlint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,11 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: "Checkout"
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332 # v4.1.7
uses: actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871 # v4.2.1
with:
fetch-depth: 0

- name: "Run actionlint"
run: |
echo "::add-matcher::.github/workflows/matchers/actionlint.json"
tools/actionlint.sh -color
2 changes: 1 addition & 1 deletion .github/workflows/add_label_automerge.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Add label
uses: actions/github-script@v6
uses: actions/github-script@60a0d83039c74a4aee543508d2ffcb1c3799cdea # v7.0.1
with:
script: |
github.rest.issues.addLabels({
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/clang-format.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ jobs:
matrix:
python-version: ["3.11"]
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871 # v4.2.1
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
uses: actions/setup-python@f677139bbe7f9c59b41e40162b753c062f5d49a3 # v5.2.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
Expand All @@ -38,4 +38,4 @@ jobs:
)
find csrc/ \( -name '*.h' -o -name '*.cpp' -o -name '*.cu' -o -name '*.cuh' \) -print \
| grep -vFf <(printf "%s\n" "${EXCLUDES[@]}") \
| xargs clang-format --dry-run --Werror
| xargs clang-format --dry-run --Werror
16 changes: 16 additions & 0 deletions .github/workflows/matchers/mypy.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"problemMatcher": [
{
"owner": "mypy",
"pattern": [
{
"regexp": "^(.+):(\\d+):\\s(error|warning):\\s(.+)$",
"file": 1,
"line": 2,
"severity": 3,
"message": 4
}
]
}
]
}
17 changes: 17 additions & 0 deletions .github/workflows/matchers/ruff.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"problemMatcher": [
{
"owner": "ruff",
"pattern": [
{
"regexp": "^(.+?):(\\d+):(\\d+): (\\w+): (.+)$",
"file": 1,
"line": 2,
"column": 3,
"code": 4,
"message": 5
}
]
}
]
}
20 changes: 5 additions & 15 deletions .github/workflows/mypy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,15 @@ on:
- main

jobs:
ruff:
mypy:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871 # v4.2.1
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
uses: actions/setup-python@f677139bbe7f9c59b41e40162b753c062f5d49a3 # v5.2.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
Expand All @@ -32,15 +32,5 @@ jobs:
pip install types-setuptools
- name: Mypy
run: |
mypy
mypy tests --follow-imports skip
mypy vllm/attention --follow-imports skip
mypy vllm/distributed --follow-imports skip
mypy vllm/engine --follow-imports skip
mypy vllm/executor --follow-imports skip
mypy vllm/lora --follow-imports skip
mypy vllm/model_executor --follow-imports skip
mypy vllm/prompt_adapter --follow-imports skip
mypy vllm/spec_decode --follow-imports skip
mypy vllm/worker --follow-imports skip

echo "::add-matcher::.github/workflows/matchers/mypy.json"
tools/mypy.sh 1
Loading