v0.2.0
What's Changed
Highlights
Custom CPU/CUDA extension to ship CPU/CUDA binaries.
PyTorch core has recently shipped a new custom op registration mechanism with torch.library with the benefit being that custom ops will compose with as many PyTorch subsystems as possible most notably NOT graph breaking with torch.compile()
We'd added some documentation for how you could register your own custom ops https://github.com/pytorch/ao/tree/main/torchao/csrc and if you learn better via example you can follow this PR #135 to add your own custom ops to torchao
.
Most notably these instructions were leveraged by @gau-nernst to integrate some new custom ops for fp6
support #223
One key benefit of integrating your kernels in torchao
directly is we thanks to our manylinux
GPU support can ensure that CPU/CUDA kernels that you've added will work on as many devices and cuda versions as possible #176
A lot of prototype and community contributions
@jeromeku was our community champion merging support for
- GaLore our first pretraining kernel that allows you to finetune llama 7b on a single 4090 card with up to 70% speedups relative to eager PyTorch
- DoRA which has been shown to yield superior fine-tuning accuracy results than QLoRA. This is an area where the community can help us benchmark more thoroughly https://github.com/pytorch/ao/tree/main/torchao/prototype/dora
- Fused int4/fp16 quantized matmul which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512 https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq
@gau-nernst merged fp6 support showing up to 8x speedups on an fp16 baseline for small batch size inference #223
NF4 support for upcoming FSDP2
@weifengpy merged support for composing FSDP2 with NF4 which makes it easy to implement algorithms like QLoRA + FSDP without writing any CUDA or C++ code. This work also provides a blueprint for how to compose smaller dtypes with FSDP #150 most notably by implementing torch.chunk()
. We hope the broader community uses this work to experiment more heavily at the intersection of distributed and quantization research and inspires many more studies such as the ones done by Answer.ai https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html
BC breaking
Deprecations
New Features
- Match autoquant API with torch.compile (#109, #162, #175)
- [Prototype] 8da4w QAT (#138, #199, #198, #211, #154, #157, #229)
- [Prototype] GaLore (#95)
- [Prototype] DoRA (#216)
- [Prototype] HQQ (#153, #185)
- [Prototype] 2:4 sparse + int8 sparse subclass (#36)
- [Prototype] Unified quantization primitives (#159, #201, #193, #220, #227, #173, #210)
- [Prototype] Pruning primitives (#148, #194)
- [Prototype] AffineQuantizedTensor subclass (#214, #230, #243, #247, #251)
- [Prototype] Add
Int4WeightOnlyQuantizer
(#119) - Custom CUDA extensions (#135, #186, #232)
- [Prototype] Add FP6 Linear (#223)
Improvements
- FSDP2 support for NF4Tensor (#118, #150, #207)
- Add save/load of int8 weight only quantized model (#122)
- Add int_scaled_mm on CPU (#121)
- Add cpu and gpu in int4wo and int4wo-gptq quantizer (#131)
- Add torch.export support to int8_dq, int8_wo, int4_wo subclasses (#146, #226, #213)
- Remove
is_gpt_fast
specialization from GTPQ (#172) - Common benchmark and profile utils (#238)
Bug fixes
- Fix padding in GPTQ (#119, #120)
- Fix
Int8DynActInt4WeightLinear
module swap (#151) - Fix
NF4Tensor.to
to use device kwarg (#158) - Fix
quantize_activation_per_token_absmax
perf regression (#253)
Performance
Docs
- Update READMEs (#140, #142, #169, #155, #179, #187, #188, #200, #217, #245)
- Add https://pytorch.org/ao (#136, #145, #163, #164, #165, #168, #177, #195, #224)
CI
- Add A10G support in CI (#176)
- General CI improvements (#161, #171, #178, #180, #183, #107, #215, #244, #257, #235, #242)
- Add expecttest to requirements.txt (#225)
- Push button binary support (#241, #240, #250)
Not user facing
Security
Untopiced
New Contributors
- @Xia-Weiwen made their first contribution in #121
- @jeromeku made their first contribution in #95
- @weifengpy made their first contribution in #118
- @aakashapoorv made their first contribution in #179
- @UsingtcNower made their first contribution in #194
- @Jokeren made their first contribution in #217
- @gau-nernst made their first contribution in #223
- @janeyx99 made their first contribution in #245
- @huydhn made their first contribution in #250
- @lancerts made their first contribution in #238
Full Changelog: v0.2.0...v0.2.1
We were able to close about half of tasks for 0.2.0, which will now spill over into upcoming releases. We will post a list for 0.3.0 next, which we aim to release at the end of May 2024. We want to follow a monthly release cadence until further notice.