08 Sep 17:18

ae8384b

v0.5.0 Latest

Latest

Highlights

We are excited to announce the 0.5 release of torchao! This release adds support for memory efficient inference, float8 training and inference, int8 quantized training, HQQ, automatic mixed-precision quantization through bayesian optimization, sparse marlin, and integrations with HuggingFace, SGLang, and diffusers.

Memory Efficient Inference Support #738

We've added support for Llama 3.1 to the llama benchmarks in TorchAO and added new features and improvements as a proof of concept for memory efficient inference. These additions allow us to to do 130k context length inference with Llama 3.1-8B with only 18.91 GB memory if we combine with kv cache quantization, int4 weight only quantization and linear causal mask.

General savings depend on technique and context length as can be seen in the following graph:

Float8 Training #551

torchao.float8 implements training recipes with the scaled float8 dtypes, as laid out in https://arxiv.org/abs/2209.05433.

With torch.compile on, current results show throughput speedups of up to 1.5x on 128 H100 GPU LLaMa 3 70B pretraining jobs (details)

from torchao.float8 import convert_to_float8_training
convert_to_float8_training(m, module_filter_fn=...)

And for an end-to-minimal training recipe of pretraining with float8, you can check out torchtitan.

Float8 Inference #740 #819

We have introduced two new quantization APIs for Float8 inference:

Float8 Weight-Only Quantization: A new quant_api float8_weight_only() has been added to apply float8 weight-only symmetric per-channel quantization to linear layers.
Float8 Dynamic Activation and Weight Quantization: A new quant_api float8_dynamic_activation_float8_weight() has been introduced to apply float8 dynamic symmetric quantization to both activations and weights of linear layers. By default PerTensor scaling. We have also added an option to do PerRow scaling of both activations and weights. By computing scales at a finer granularity, it can potentially reduce the overall quantization error and increase performance by reducing dynamic quantization overhead.

Example usage:

import torch
from torchao.quantization import quantize_, float8_weight_only, float8_dynamic_activation_float8_weight, PerRow

# Create a model
model = YourModel()

# Apply float8 weight-only quantization
quantize_(model, float8_weight_only())

# Apply float8 dynamic activation and weight quantization
quantize_(model, float8_dynamic_activation_float8_weight())

# Apply PerRow scaling to weight and activations
quantize_(linear_module, float8_dynamic_activation_float8_weight(granularity=PerRow()))

Notes:

These new APIs are designed to work with PyTorch 2.5 and later versions.
float8_dynamic_activation_float8_weight requires CUDA devices with compute capability 8.9 or higher for hardware acceleration.

Int8 quantized training #644 #748

@gau-nernst introduced 2 experimental works on training using INT8.

INT8 quantized training (#644): weight is quantized to INT8 during the whole duration of training to save memory. Compute remains in high precision. To train the model effectively with only quantized weights, we use stochastic rounding for weight update. Right now, memory saving is not too competitive compared to compiled BF16 baseline.
INT8 mixed-precision training (#748): weight is kept in the original high precision, but weight and activation are dynamically quantized to INT8 during training to utilize INT8 tensor cores. We observe up to 70% speedup for Llama2 pre-training on 4090, and 20% speedup for Llama3 pre-training on 8x A100 with FSDP2.

from torchao.quantization import quantize_
from torchao.prototype.quantized_training import int8_weight_only_quantized_training, int8_mixed_precision_training

model = YourModel()

# apply INT8 quantized training
quantize_(model, int8_weight_only_quantized_training())

# apply INT8 mixed-precision training
quantize_(model, int8_mixed_precision_training())

For more information and benchmark results, see README and the respective PR (#644 and #748)

HQQ Integration in torchao #605 #786

hqq is added to existing torchao APIs, it gives improvements on model accuracy and leverages the existing efficient kernels in torchao. We enabled hqq for int4_weight_only API:

quantize_(model, int4_weight_only(group_size, use_hqq=True)

We also added this to the uintx api for accuracy experiments (current uintx kernels are slow):

quantize_(model, uintx_weight_only(torch.uint2, group_size, use_hqq=True)

Automatic Mixed-Precision Quantization through Bayesian Optimization #592, #694

We provided a Bayesian Optimization (BO) tool leveraging Ax to auto search mixed-precision weight-only quantization configuration, i.e., bit width and group size of intN_weight_only(bit_width, group_size) for each layer. It also includes a sensitivity analysis tool to calculate layer-wise average Hessian trace and average fisher information matrix trace, which is an optional step to customize and improve BO search.

To optimize for model accuracy under a model size constraint (GB):

python --BO_acc_modelsize.py --checkpoint=/tmp/Meta-Llama-3-8B --model_size_constraint=6.0

To optimize for inference throughput under a model perplexity constraint:

python --BO_acc_throughput.py --checkpoint=/tmp/Meta-Llama-3-8B --ppl_constraint=7.5

For more detailed usage, please refer to this README. The mixed-precision quantization searched by this tool reduces 20.1% model size with 2.8% perplexity reduction, and improves 15.1% inference throughput with 3.2% perplexity reduction on the Llama3-8B model compared to int8 uniform quantization.

Sparse Marlin #621, #733

@Diogo-V added sparse-marlin, a W4AFP16 2:4 sparse kernel, support to TorchAO.
On Meta LLama3, we observe a 25% tok/s increase (180 -> 226) compared to our existing int4-wo implementation.

from torchao.quantization.quant_api import quantize_, int4_weight_only
from torchao.dtypes import MarlinSparseLayoutType
quantize_(model, int4_weight_only(layout_type=MarlinSparseLayoutType()))

Model	Technique	Tokens/Second	Memory Bandwidth (GB/s)	Peak Memory (GB)	Model Size (GB)
Llama-3-8B	Base (bfloat16)	95.64	1435.54	16.43	15.01
	int8dq	8.61	64.75	9.24	7.52
	int8wo	153.03	1150.80	10.42	7.52
	int4wo-64	180.80	763.33	6.88	4.22
	int4wo-64-sparse-marlin	226.02	689.20	5.32	3.05

HuggingFace Integration

torchao is integrated into huggingface: https://huggingface.co/docs/transformers/main/en/quantization/torchao now you can use int4_weight_only, int8_weight_only and int8_dynamic_activation_int8_weight through TorchAoConfig in huggingface. Currently available in huggingface main branch only.

SGLang Integration

torchao is also integrated into sglang (sgl-project/sglang#1341) for llama3 model, you can try out with:

python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --torchao-config int4wo-128

Supported configurations are ["int4wo-<group_size>", "int8wo", "int8dq", "fp8wo" (only available in torchao 0.5+)]

diffusers Integration

diffusers-torchao provides end-to-end inference and experimental training recipes to use torchao with diffusers in this repo. We demonstrate 53.88% speedup on Flux.1-Dev* and 27.33% speedup on CogVideoX-5b when comparing compiled quantized models against their standard bf16 counterparts.

BC Breaking

Add layout option to woq int4 api #670

# torchao 0.4.0
from torchao.quantization import quantize_, int4_weight_only
quantize_(my_model, int4_weight_only(inner_k_tiles=8))

# torchao 0.5.0
from torchao.quantization import quantize, int4_weight_only
quant...

Contributors

raziel, crcrpar, and 9 other contributors

Assets 2

07 Aug 16:48

jcaip

v0.4.0

245ab4e

v0.4.0

Highlights

We are excited to announce the 0.4 release of torchao! This release adds support for KV cache quantization, quantization aware training (QAT), low bit optimizer support, composing quantization and sparsity, and more!

KV cache quantization (#532)

We've added support for KV cache quantization, showing a peak memory reduction from 19.7 -> 19.2 GB on Llama3-8B at an 8192 context length. We plan to investigate Llama3.1 next.

Quantization-Aware Training (QAT) (#383, #555)

We now support two QAT schemes for linear layers: Int8 per token dynamic activations + int4 per group weights, and int4 per group weights (using the efficient tinygemm int4 kernel after training). Users can access this feature by transforming their models before and after training using the appropriate quantizer, for example:

from torchao.quantization.prototype.qat import Int8DynActInt4WeightQATQuantizer

# Quantizer for int8 dynamic per token activations +
# int4 grouped per channel weights, only for linear layers
qat_quantizer = Int8DynActInt4WeightQATQuantizer()

# Insert "fake quantize" operations into linear layers.
# These operations simulate quantization numerics during
# training without performing any dtype casting
model = qat_quantizer.prepare(model)

# Convert fake quantize to actual quantize operations
model = qat_quantizer.convert(model)

Initial evaluation results indicate that QAT in torchao can recover up to 96% of quantized accuracy degradation on hellaswag and up to 68% of quantized perplexity degradation on wikitext for Llama3 compared to post-training quantization (PTQ). For more details, please refer to the README and this blog post.

Composing quantization and sparsity (#457, #473)

We've added support for composing int8 dynamic quantization with 2:4 sparsity, using the quantize_ API. We also added SAM benchmarks that show a 7% speedup over standalone sparsity / int8 dynamic quantization here.

from torchao.quantization import quantize_, int8_dynamic_activation_int8_semi_sparse_weight
quantize_(model, int8_dynamic_activation_int8_semi_sparse_weight())

Community Contributions

low-bit optimizer support (#478, #463, #482, #484, #538)

@gau-nernst added implementations for 4-bit, 8-bit, and FP8 Adam with FSDP2/FSDP support. Our API is a drop-in replacement for torch.optim.Adam and can be used as follows:

from torchao.prototype.low_bit_optim import Adam8bit, Adam4bit, AdamFp8
from torchao.prototype.low_bit_optim import AdamW8bit, AdamW4bit, AdamWFp8


model = ...
optim = Adam8bit(model.parameters()) # replace with Adam4bit and AdamFp8 for the 4 / fp8 versions

For more information about low bit optimizer support please refer to our README.

Improvements to 4-bit quantization (#517, #552, #544, #479 )

@bdhirsh @jeromeku @yanbing-j @manuelcandales @larryliu0820 added torch.compile support for NF4 Tensor, custom CUDA int4 tinygemm unpacking ops, and several bugfixes to torchao

BC breaking

quantize has been renamed to quantize_ #467

# for torchao 0.4
from torchao.quantization import quantize_, int8_weight_only
quantize_(model, int8_weight_only())

# for torchao 0.3
from torchao.quantization import quantize, int8_weight_only
quantize(model, int8_weight_only())

apply_sparse_semi_structured has been deprecated in favor of sparsify_ which matches the quantize_ API #473

# for torchao 0.4
from torchao.sparsity import _sparsify, semi_sparse_weight
sparsify_(model, semi_sparse_weight())

# for torchao 0.3
from torchao.sparsity import apply_sparse_semi_structured
apply_sparse_semi_structured(model)

Deprecations

New Features

Added kv_cache quantization #532
Migrated float8_experimental to torchao.float8, enabling float8 training support #551 #529
Added FP5 E2M2 #399
Added 4-bit, 8-bit, and FP8 ADAM support #478 #463 #482
Added FSDP2 support for low-bit optimizers #484
[prototype] mixed-precision quantization and eval framework #531
Added int4 weight-only QAT support #555, #383
Added custom CUDA tinygemm unpacking ops #415

Improvements

Composing quantization and sparsity now uses the unified AQT Layout #498
Added default inductor config settings #423
Better dtype and device handling for Int8DynActInt4WeightQuantizer and Int4WeightOnlyQuantizer #475 #479
Enable model.to for int4/int8 weight only quantized models #486 #522
Added more logging to TensorCoreTiledAQTLayout #520
Added general fake_quantize_affine op with mask support #492 #500
QAT now uses the shared fake_quantize_affine primitive #527
Improve FSDP support for low-bit optimizers #538
Custom op and inductor decomp registration now uses a decorator #434
Updated torch version to no longer require unwrap_tensor_subclass #595

Bug fixes

Fixed import for TORCH_VERSION_AFTER_* #433
Fixed crash when PYTORCH_VERSION is not defined #455
Added torch.compile support for NF4Tensor #544
Added fbcode check to fix torchtune in Genie #480
Fixed int4pack_mm error #517
Fixed cuda device check #536
Weight shuffling now runs on CPU for int4 quantization due to a MPS memory issue #552
Scale and input now are the same dtype for int8 weight only quantization #534
Fixed FP6-LLM API #595

Performance

Added segment-anything-fast benchmarks for composed quantization + sparsity #457
Updated low-bit Adam benchmark #481

Docs

Updated README.md #583 #438 #445 #460
Updated installation instructions #447 #459
Added more docs for int4_weight_only API #469
Added developer guide notebook #588
Added optimized model serialization/deserialization doc #524 #525
Added new float8 feature tracker #557
Added static quantization tutorial for calibration-based techniques #487

Devs

Fix numpy version in CI #537
trymerge now uploads merge records to s3 #448
Updated python version to 3.9 #488
torchao no long depends on torch #449
benchmark_model now accepts args and kwargs and supports cpu and mps backends #586 #406
Add git version suffix to package name #547
Added validations to torchao #453 #454
Parallel test support with pytest-xdist #518
Quantizer now uses logging instead of print #472

Not user facing

Refactored _replace_linear_8da4w #451
Remove unused code from AQT implementation #476 #440 #441 #471
Improved error message for lm_eval script #444
Updated HF_TOKEN env variable #427
Fixed typo in Quant-LLM in #450
Add a test for map_location="cpu" in #497
Removed sparse test collection warning #489
Refactored layout imple...

Contributors

jeromeku, larryliu0820, and 9 other contributors

Assets 2

26 Jun 20:36

supriyar

v0.3.0

a2ba345

v0.3.1

Highlights

We are excited to announce the 0.3 release of torchao! This release adds support for a new quantize API, MX format, FP6 dtype and bitpacking, 2:4 sparse accelerated training and benchmarking infra for llama2/llama3 models.

`quantize` API (#256)

We added a tensor subclass based quantization API, see docs and README for details on usage, this is planned to replace all existing quantization APIs in torchao for torch 2.4 and later.

Accelerated training with 2:4 sparsity (#184)

You can now accelerate training with 2:4 sparsity, using the runtime pruning + compression kernels written by xFormers. These kernels process a 4x4 sub-tile to be 2:4 sparse in both directions, to handle both the forward and backward pass when training. We see a 1.3x speedup for the MLP layers of ViT-L across a forward and backwards pass.

MX support (#264)

We added prototype support for MX format for training and inference with a reference native PyTorch implementation of training and inference primitives for using MX accelerated matrix multiplications. The MX numerical formats are new low precision formats with recent acceptance into the OCP spec:
https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

Benchmarking (#276, #374)

We added a stable way to benchmark llama2 and llama3 models that includes perf/accuracy comparisons. See torchao/_models/llama/benchmarks.sh for more details.

🌟 💥 Community Contributions 🌟 💥

FP6 support (#279, #283, #358)

@gau-nernst Added support for FP6 dtype and mixed matmul FP16 x FP6 kernel with support for torch.compile. Benchmark results show a 2.3x speedup over BF16 baseline for meta-llama/Llama-2-7b-chat-hf

Bitpacking (#307, #282)

@vayuda, @melvinebenezer @CoffeeVampir3 @andreaskoepf Added support for packing/unpacking lower bit dtypes leveraging torch.compile to generate the kernels for this and added UInt2 and Bitnet tensor based on this approach.

FP8 split-gemm kernel #263

Added the kernel written by @AdnanHoque to torchao with speedups compared to the cuBLAS kernel for batch size <=16

BC Breaking

Deprecations

Deprecate top level quantization APIs #344

1. int8 weight only quantization

apply_weight_only_int8_quant(model) or change_linear_weights_to_int8_woqtensors(model)

-->

# for torch 2.4+
from torchao.quantization import quantize, int8_weight_only
quantize(model, int8_weight_only())

# for torch 2.2.2 and 2.3
from torchao.quantization.quant_api import change_linear_weights_to_int8_woqtensors
change_linear_weights_to_int8_woqtensors(model)

2. int8 dynamic quantization

apply_dynamic_quant(model) or change_linear_weights_to_int8_dqtensors(model)

-->

# Fuse the int8*int8 -> int32 matmul and subsequent mul op avoiding materialization of the int32 intermediary tensor
torch._inductor.config.force_fuse_int_mm_with_mul = True

# for torch 2.4+
from torchao.quantization import quantize, int8_dynamic_activation_int8_weight
quantize(model, int8_dynamic_activation_int8_weight())

# for torch 2.2.2 and 2.3
from torchao.quantization.quant_api import change_linear_weights_to_int8_dqtensors
change_linear_weights_to_int8_dqtensors(model)

3. int4 weight only quantization

change_linear_weights_to_int4_wotensors(model)

-->

# for torch 2.4+
from torchao.quantization import quantize, int4_weight_only
quantize(model, int4_weight_only())

# for torch 2.2.2 and 2.3
from torchao.quantization.quant_api import change_linear_weights_to_int4_woqtensors
change_linear_weights_to_int4_woqtensors(model)

New Features

Add quantize #256
Add a prototype of MX format training and inference #264
[FP6-LLM] Port splitK map from DeepSpeed #283
Improve FP6-LLM 2+4bit weight splitting + user API #279
Bitpacking #291
training acceleration via runtime semi-structured sparsity #184
Bitpackingv2 #307
Add FP6-LLM doc and move FP6-LLM to prototype #358
Added first bits of Uint2Tensor and BitnetTensor #282

Improvements

Improve primitives for FP6 quant #248
Extract eval code from GPTQ for more general usage #275
Factor out the specific configurations to helper functions #286
Add support for AQTLayout, PlainAQTLayout and TensorCoreTiledAQTLayout #278
Graceful handling of cpp extensions #296
Refactor int8 dynamic quantization with call to quantize #294
[NF4][FSDP] return contiguous quantization_factor #298
Refactor int4 and int8 weight only quantization to use quantize #301
Adding a quick way for users to test model eval for hf models #328
Wrap torch.ops.quantized_decomposed to improve import errors #310
[NF4Tensor] Switch to save for backward since are now a tensor input #323
Refactor rest of tinygemm quant primitive ops #321
Move some util functions from quantization.utils to torchao.utils #337
Clean up FP6-LLM #304
Move quant ops to utils.py #331
FP6-LLM clean up (again) #339
Improving hf_eval.py #342
Generalize Model Size Code #364
Minor upgrades to bit pack #347
Factor out dispatch and layout registration table #360
Add register_apply_tensor_subclass #366
Refactor custom FPx cast #363
Remove all dependencies except torch #369
Enable a test for loading state_dict with tensor subclasses #389
073 scripts for benchmarks #372
Add WOQ int8 test with Inductor Freeze #362
Benchmarking updates for semi-structured sparse training #398
add FSDP QLoRA test and revert failing PR #403
Refactor the API for quant method argument for quantize function #400
eval script fixes #414

Bug Fixes

Fixed the HQQ import skip #262
fixing autoquant bug #265
Fix eval import after #275 #290
Fixed f-string printing of NF4Tensors #297
Check and fix dequantize_affine is idempotent #309
Update old pretrained TorchVision API in ao tutorials (#313) #314
Fix dimension issues for int4 weight only quant path #330
Fix compile in hf_eval.py #341
task_list to tasks in hf_eval #343
fixing peak memory stats for benchmark #353
Fix inductor config BC change #382
fixing scripts #395

Performance

FP8 splitgemm user defined triton kernel #263
sparse benchmarking numbers #303
Fix FP6-LLM benchmark #312
Adding Llama to TorchAO #276
Generalize Model Size Code #364
eval script for llama #374
077 autoquant gpt fast #361

Docs

add static folder for images + fix links #271
Fix Readme and remove unused kernel #270
Kernel docs #274
Quantization Docstrings #273
Add AffineQuantizedTensor based workflow doc and examples #277
Add AUTOQUANT_CACHE docs for reusing the same quantization plan #329
Update nightly build instructions #334
add link to benchmarking script #355
New README #392
Minor README updates #401
Add quantize to ...

Contributors

kit1980, vkuzo, and 12 other contributors

Assets 2

20 May 20:52

cpuhrsch

v0.2.0

f0f00ce

v0.2.0

What's Changed

Highlights

Custom CPU/CUDA extension to ship CPU/CUDA binaries.

PyTorch core has recently shipped a new custom op registration mechanism with torch.library with the benefit being that custom ops will compose with as many PyTorch subsystems as possible most notably NOT graph breaking with torch.compile()

We'd added some documentation for how you could register your own custom ops https://github.com/pytorch/ao/tree/main/torchao/csrc and if you learn better via example you can follow this PR #135 to add your own custom ops to torchao.

Most notably these instructions were leveraged by @gau-nernst to integrate some new custom ops for fp6 support #223

One key benefit of integrating your kernels in torchao directly is we thanks to our manylinux GPU support can ensure that CPU/CUDA kernels that you've added will work on as many devices and cuda versions as possible #176

A lot of prototype and community contributions

@jeromeku was our community champion merging support for

GaLore our first pretraining kernel that allows you to finetune llama 7b on a single 4090 card with up to 70% speedups relative to eager PyTorch
DoRA which has been shown to yield superior fine-tuning accuracy results than QLoRA. This is an area where the community can help us benchmark more thoroughly https://github.com/pytorch/ao/tree/main/torchao/prototype/dora
Fused int4/fp16 quantized matmul which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512 https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq

@gau-nernst merged fp6 support showing up to 8x speedups on an fp16 baseline for small batch size inference #223

NF4 support for upcoming FSDP2

@weifengpy merged support for composing FSDP2 with NF4 which makes it easy to implement algorithms like QLoRA + FSDP without writing any CUDA or C++ code. This work also provides a blueprint for how to compose smaller dtypes with FSDP #150 most notably by implementing torch.chunk(). We hope the broader community uses this work to experiment more heavily at the intersection of distributed and quantization research and inspires many more studies such as the ones done by Answer.ai https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html

BC breaking

Deprecations

New Features

Match autoquant API with torch.compile (#109, #162, #175)
[Prototype] 8da4w QAT (#138, #199, #198, #211, #154, #157, #229)
[Prototype] GaLore (#95)
[Prototype] DoRA (#216)
[Prototype] HQQ (#153, #185)
[Prototype] 2:4 sparse + int8 sparse subclass (#36)
[Prototype] Unified quantization primitives (#159, #201, #193, #220, #227, #173, #210)
[Prototype] Pruning primitives (#148, #194)
[Prototype] AffineQuantizedTensor subclass (#214, #230, #243, #247, #251)
[Prototype] Add Int4WeightOnlyQuantizer (#119)
Custom CUDA extensions (#135, #186, #232)
[Prototype] Add FP6 Linear (#223)

Improvements

FSDP2 support for NF4Tensor (#118, #150, #207)
Add save/load of int8 weight only quantized model (#122)
Add int_scaled_mm on CPU (#121)
Add cpu and gpu in int4wo and int4wo-gptq quantizer (#131)
Add torch.export support to int8_dq, int8_wo, int4_wo subclasses (#146, #226, #213)
Remove is_gpt_fast specialization from GTPQ (#172)
Common benchmark and profile utils (#238)

Bug fixes

Fix padding in GPTQ (#119, #120)
Fix Int8DynActInt4WeightLinear module swap (#151)
Fix NF4Tensor.to to use device kwarg (#158)
Fix quantize_activation_per_token_absmax perf regression (#253)

Performance

Chunk NF4Tensor construction to reduce memory spike (#196)
Fix intmm benchmark script (#141)

Docs

Update READMEs (#140, #142, #169, #155, #179, #187, #188, #200, #217, #245)
Add https://pytorch.org/ao (#136, #145, #163, #164, #165, #168, #177, #195, #224)

CI

Add A10G support in CI (#176)
General CI improvements (#161, #171, #178, #180, #183, #107, #215, #244, #257, #235, #242)
Add expecttest to requirements.txt (#225)
Push button binary support (#241, #240, #250)

Not user facing

Security

Untopiced

Version bumps (#125, #234)
Don't import _C in fbcode (#218)

New Contributors

@Xia-Weiwen made their first contribution in #121
@jeromeku made their first contribution in #95
@weifengpy made their first contribution in #118
@aakashapoorv made their first contribution in #179
@UsingtcNower made their first contribution in #194
@Jokeren made their first contribution in #217
@gau-nernst made their first contribution in #223
@janeyx99 made their first contribution in #245
@huydhn made their first contribution in #250
@lancerts made their first contribution in #238

Full Changelog: v0.2.0...v0.2.1

We were able to close about half of tasks for 0.2.0, which will now spill over into upcoming releases. We will post a list for 0.3.0 next, which we aim to release at the end of May 2024. We want to follow a monthly release cadence until further notice.

Contributors

huydhn, Jokeren, and 8 other contributors

Assets 2

04 Apr 23:18

jerryzh168

v0.1

ff28895

TorchAO 0.1.0: First Release

Highlights

We’re excited to announce the release of TorchAO v0.1.0! TorchAO is a repository to host architecture optimization techniques such as quantization and sparsity and performance kernels on different backends such as CUDA and CPU. In this release, we added support for a few quantization techniques like int4 weight only GPTQ quantization, added nf4 dtype support for QLoRA and sparsity features like WandaSparsifier, we also added autotuner that can tune triton integer matrix multiplication kernels on cuda.

Note: TorchAO is currently in a pre-release state and under extensive development. The public APIs should not be considered stable. But we welcome you to try out our APIs and offerings and provide any feedback on your experience.

torchao 0.1.0 will be compatible with PyTorch 2.2.2 and 2.3.0, ExecuTorch 0.2.0 and TorchTune 0.1.0.

New Features

Quantization

Added tensor subclass based quantization APIs: change_linear_weights_to_int8_dqtensors, change_linear_weights_to_int8_woqtensors and change_linear_weights_to_int4_woqtensors (#1)
Added module based quantization APIs for int8 dynamic and weight only quantization apply_weight_only_int8_quant and apply_dynamic_quant (#1)
Added module swap version of int4 weight only quantization Int4WeightOnlyQuantizer and Int4WeightOnlyGPTQQuantizer used in TorchTune (#119, #116)
Added int8 dynamic activation and int4 weight quantization Int8DynActInt4WeightQuantizer and Int8DynActInt4WeightGPTQQuantizer, used in ExecuTorch (#74) (available after torch 2.3.0 and later)

Sparsity

Added WandaSparsifier that prunes both weights and activations (#22)

Kernels

Added autotuner for int mm Triton kernels (#41)

dtypes

nf4 tensor subclass and nf4 linear (#37, #40, #62)
Added uint4 dtype tensor subclass (#13)

Improvements

Setup github workflow for regression testing (#50)
Setup github workflow for torchao-nightly release (#54)

Documentation

Added tutorials for quantizing vision transformer model (#60)
Added tutorials for how to add an op for nf4 tensor (#54)

Notes

we are still debugging the accuracy problem for Int8DynActInt4WeightGPTQQuantizer
Save and load does not work well for tensor subclass based APIs yet
We will consolidate tensor subclass and module swap based quantization APIs later
uint4 tensor subclass is going to be merged into pytorch core in the future
Quantization ops in quant_primitives.py will be deduplicated with similar quantize/dequantize ops in PyTorch later

Assets 3

Releases: pytorch/ao

v0.5.0

Highlights

Memory Efficient Inference Support #738

Float8 Training #551

Float8 Inference #740 #819

Int8 quantized training #644 #748

HQQ Integration in torchao #605 #786

Automatic Mixed-Precision Quantization through Bayesian Optimization #592, #694

Sparse Marlin #621, #733

HuggingFace Integration

SGLang Integration

diffusers Integration

BC Breaking

Add layout option to woq int4 api #670

Contributors

v0.4.0

v0.4.0

Highlights

KV cache quantization (#532)

Quantization-Aware Training (QAT) (#383, #555)

Composing quantization and sparsity (#457, #473)

Community Contributions

low-bit optimizer support (#478, #463, #482, #484, #538)

Improvements to 4-bit quantization (#517, #552, #544, #479 )

BC breaking

Deprecations

New Features

Improvements

Bug fixes

Performance

Docs

Devs

Not user facing

Contributors

v0.3.1

v0.3.1

Highlights

quantize API (#256)

Accelerated training with 2:4 sparsity (#184)

MX support (#264)

Benchmarking (#276, #374)

🌟 💥 Community Contributions 🌟 💥

FP6 support (#279, #283, #358)

Bitpacking (#307, #282)

FP8 split-gemm kernel #263

BC Breaking

Deprecations

1. int8 weight only quantization

2. int8 dynamic quantization

3. int4 weight only quantization

New Features

Improvements

Bug Fixes

Performance

Docs

Contributors

v0.2.0

What's Changed

Highlights

Custom CPU/CUDA extension to ship CPU/CUDA binaries.

A lot of prototype and community contributions

NF4 support for upcoming FSDP2

BC breaking

Deprecations

New Features

Improvements

Bug fixes

Performance

Docs

CI

Not user facing

Security

Untopiced

New Contributors

Contributors

TorchAO 0.1.0: First Release

Highlights

New Features

Quantization

`quantize` API (#256)