Key changes
- General improvements and fixes
- ISQ FP8
- GPTQ Marlin
- 26% performance boost on Metal
- Python package wheels are available. See below and the various PyPi packages.
What's Changed
- Update docs and deps by @EricLBuehler in #804
- Support Qwen 2.5 by @EricLBuehler in #805
- Update docs with clarifications and notes by @EricLBuehler in #806
- Improved inverting for Attention Mask by @EricLBuehler in #811
- Fix
repeat_interleave
by @EricLBuehler in #812 - Use f32 for neg inf in cross attn mask by @EricLBuehler in #814
- Improve UQFF memory efficiency by @EricLBuehler in #813
- Update Metal, CUDA Candle impls and ISQ by @EricLBuehler in #816
- chore: update pagedattention.cu by @eltociear in #822
- MLlama - if f16, load vision model in f32 by @EricLBuehler in #820
- ci: Upgrade actions by @polarathene in #823
- docs: added a top button because of readme length by @bhargavshirin in #833
- Typo in error of model architecture enum by @nikolaydubina in #835
- Expose config for Rust api, tweak modekind by @EricLBuehler in #841
- Add ISQ FP8 by @EricLBuehler in #832
- Fix Metal F8 build errors by @EricLBuehler in #846
- Bump pyo3 from 0.22.3 to 0.22.4 by @dependabot in #854
- Generate standalone UQFF models by @EricLBuehler in #849
- Update README.MD by @kaleaditya779 in #848
- Add GPTQ Marlin support for 4 and 8 bit by @EricLBuehler in #856
- Adds wrap_help feature to clap by @DaveTJones in #858
- Patch UQFF metal generation by @EricLBuehler in #857
- Add GGUF Qwen 2 by @EricLBuehler in #860
- Avoid duplicate Metal command buffer encodings during ISQ by @EricLBuehler in #861
- Fix for isnanf by @EricLBuehler in #859
- Fix some metal warnings by @EricLBuehler in #862
- Support interactive mode markdown bold/italics via ANSI codes by @EricLBuehler in #879
- Even better V-Llama accuracy by @EricLBuehler in #881
- Trim whitespace (such as carriage returns) from nvidia-smi output. by @asaddi in #880
- MODEL_ID not "MODEL_ID" by @simonw in #863
- Sync ggml metal kernels by @EricLBuehler in #885
- Increase Metal decoding T/s by 26% by @EricLBuehler in #887
- Remove pretty-printer by @EricLBuehler in #889
- Fix typo in documentation by @msk in #888
- fix Half-Quadratic Quantization and Dequantization on CPU by @haricot in #873
- Prepare for v0.3.2 by @EricLBuehler in #891
New Contributors
- @bhargavshirin made their first contribution in #833
- @nikolaydubina made their first contribution in #835
- @kaleaditya779 made their first contribution in #848
- @DaveTJones made their first contribution in #858
- @asaddi made their first contribution in #880
- @simonw made their first contribution in #863
- @msk made their first contribution in #888
- @haricot made their first contribution in #873
Full Changelog: v0.3.1...v0.3.2