v0.6.5
What's Changed
- xpu: refactor XPU worker & executor by @AlpinDale in #861
- build: add jinja2 to requirements file by @AlpinDale in #862
- attention: add
AttentionState
abstraction by @AlpinDale in #863 - xpu: disable punica kernels for XPU by @AlpinDale in #864
- executor: pipe
worker_class_fn
arg in executor by @AlpinDale in #865 - server: log the process occupying our port by @AlpinDale in #866
- feat: AWQ quantization for InternVL by @AlpinDale in #867
- Rewrite DRY sampler to be a lot faster by @50h100a in #868
- fix: ROCm build by @Naomiusearch in #817
- fix: temp_last warning being repeated for every output token by @AlpinDale in #869
- feat: add support for chunked prefill + prefix caching by @AlpinDale in #871
- async: avoid premature exit in the async generator by @AlpinDale in #872
- cpu: fix
mm_limits
initialization by @AlpinDale in #873 - spec decoding: set the draft model ctxlen to target model by @AlpinDale in #874
- sampler: pad dry sequence breakers tensor by @AlpinDale in #875
- fix:
add_generation_template
->add_generation_prompt
in llm by @AlpinDale in #877 - Update README.md by @NoahBPeterson in #876
- api: fix crashes under very high loads by @AlpinDale in #878
- build: pass
PYTHONPATH
from setup.py to cmake by @AlpinDale in #879 - async: disable multi-step scheduling for sync engine by @AlpinDale in #880
- api: better startup failure UX by @AlpinDale in #881
- chore: consolidate environment variables within one file by @AlpinDale in #882
- core: fix spec decode metrics and envs circular import by @AlpinDale in #889
- feat: add support for audio models by @AlpinDale in #891
- distributed: fix issue for when nodes have multiple network interfaces by @AlpinDale in #892
- rocm: fix compile issues with rocm 6.2 by @AlpinDale in #893
- build: fix invalid path for envs.py in setup by @AlpinDale in #894
- kernel: use
cub::BlockReduce
instead of custom impl by @AlpinDale in #895 - fix: Phi 3.5 Vision model loading by @AlpinDale in #896
- api: add client timeouts for the ZeroMQ server by @AlpinDale in #897
- feat: add torch.compile for GemmaRMSNorm by @AlpinDale in #898
- spec decode: add support for EAGLE by @AlpinDale in #899
- fix:
ShardedStateLoader
with fp8 quant by @AlpinDale in #900 - kernel: do not compile machete for cuda 11 and below by @AlpinDale in #901
- chore: add AphroditeParameter support for FP8 quant by @AlpinDale in #902
- spec decode: fix logprobs when using speculative decoding by @AlpinDale in #904
- api: error suppression cleanup + timeout suppression on aborts by @AlpinDale in #905
- ray: better error when placement group topology is incorrect by @AlpinDale in #906
- xpu: refactor the model runner for tensor parallelism by @AlpinDale in #910
- fix: empty prompt crashing the server by @AlpinDale in #912
- quantization: update marlin to use
AphroditeParameters
by @AlpinDale in #913 - core: add multi-step scheduling support for the synchronous engine by @AlpinDale in #914
- api: add json_schema to OpenAI server by @AlpinDale in #915
- fix: phi3v crash with unusual image sizes by @AlpinDale in #916
- feat: multi-image input support for Phi3V by @AlpinDale in #917
- spec decode: streamline batch expansion tensor manipulation by @AlpinDale in #918
- api: use fp32 for base64 embeddings by @AlpinDale in #919
- core: improve warmup times for prefix caching in block manager v2 by @AlpinDale in #920
- quants: update
qqq
andgptq_marlin_24
to use AphroditeParameters by @AlpinDale in #921 - distributed: fix custom allreduce p2p cache file generation by @AlpinDale in #922
- neuron: add support for tensor parallelism by @AlpinDale in #923
- quants: update compressed tensors lifecycle to remove
prefix
fromcreate_weights
by @AlpinDale in #924 - feat: add async postprocessor by @AlpinDale in #925
- api: add endpoint for loading and unloading the model by @AlpinDale in #926
- feat: add single user mode by @AlpinDale in #927
- api: add inline model loading by @AlpinDale in #928
- api: support aphrodite_config.yaml with inline loading by @AlpinDale in #929
- fix: inline model loading conflicts with lora by @AlpinDale in #930
- core: do not compile for profiling by @AlpinDale in #931
- xpu: support pipeline parallel by @AlpinDale in #932
- fix: phi3v image_idx in async server by @AlpinDale in #933
- feat: add fused Marlin MoE kernel by @AlpinDale in #934
- chore: multi-image support for llava-next by @AlpinDale in #935
- model: add support for paligemma2 by @AlpinDale in #936
- vlm: stack multimodal tensors to represent multiple images within each prompt by @AlpinDale in #937
- core: do not compile ScalarType for torch < 2.4.0 by @AlpinDale in #938
- core: add virtual engine for async outproc by @AlpinDale in #939
- api: log prompt truncation by @AlpinDale in #940
- vlm: fix incompatibility nested tensors and multi-image llava-next by @AlpinDale in #941
- vlm: fix persimmon and fuyu issues with transformers 4.45 by @AlpinDale in #942
- Fix SentencePieceTokenizer error when generating on Mistral Large 2411 with
--tokenizer-mode mistral
by @khanonnie in #943 - core: use flashinfer for FP8 KV when available by @AlpinDale in #944
- tests: update flashinfer test for #944 by @AlpinDale in #945
- quants: add triton kernels for AWQ by @AlpinDale in #946
- tests: add kernel tests for causal_conv1d and mamba_ssm by @AlpinDale in #947
- fix: do not register punica with torch if using older torch by @AlpinDale in #948
- tpu: avoid dynamo guard eval overhead by @AlpinDale in #949
- fix: issues with flashinfer fp8 kv by @AlpinDale in #950
- api: optimize zeromq frontend performance by @AlpinDale in #951
- tpu: remove torch._dynamo.reset() by @AlpinDale in #952
- vlm: fix errors on ragged NestedTensors by @AlpinDale in #953
- spec decode: match the original rank computation impl for spec decoding by @AlpinDale in #954
- core: support multi-step scheduling w/ async post-processor by @AlpinDale in #955
- Revert "fix: issues with flashinfer fp8 kv (#950)" by @AlpinDale in #956
- misc: extend cuda graph capture size for H200 by @AlpinDale in #957
- fix: gguf vocab embddings in TP by @AlpinDale in #958
- quant: update tpu_int8 to use AphroditeParameters by @AlpinDale in #959
- neuron: support for context length and token bucketing by @AlpinDale in #960
- quant: support pre-quanted bitsandbytes checkpoints by @AlpinDale in #961
- vlm: do not allow max_model_len overflow by @AlpinDale in #962
- core: support logprobs with multi-step scheduling by @AlpinDale in #963
- ci: bump aphrodite version to 0.6.5 by @AlpinDale in #964
New Contributors
- @NoahBPeterson made their first contribution in #876
- @khanonnie made their first contribution in #943
Full Changelog: v0.6.4.post1...v0.6.5