Releases: facebookresearch/xformers
Releases · facebookresearch/xformers
`v0.0.28.post3` - build for PyTorch 2.5.1
[0.0.28.post3] - 2024-10-30
Pre-built binary wheels require PyTorch 2.5.1
`v0.0.28.post2` - build for PyTorch 2.5.0
[0.0.28.post2] - 2024-10-18
Pre-built binary wheels require PyTorch 2.5.0
`0.0.28.post1` - fixing upload for cuda 12.4 wheels
[0.0.28.post1] - 2024-09-13
Properly upload wheels for cuda 12.4
FAv3, profiler update & AMD
Pre-built binary wheels require PyTorch 2.4.1
Added
- Added wheels for cuda 12.4
- Added conda builds for python 3.11
- Added wheels for rocm 6.1
Improved
- Profiler: Fix computation of FLOPS for the attention when using xFormers
- Profiler: Fix MFU/HFU calculation when multiple dtypes are used
- Profiler: Trace analysis to compute MFU & HFU is now much faster
- fMHA/splitK: Fixed
nan
in the output when using atorch.Tensor
bias where a lot of consecutive keys are masked with-inf
- Update Flash-Attention version to
v2.6.3
when building from scratch - When using the most recent version of Flash-Attention, it is no longer possible to mix it with the cutlass backend. In other words, it is no longer possible to use the cutlass Fw with the flash Bw.
Removed
- fMHA: Removed
decoder
andsmall_k
backends - profiler: Removed
DetectSlowOpsProfiler
profiler - Removed compatibility with PyTorch < 2.4
- Removed conda builds for python 3.9
- Removed windows pip wheels for cuda 12.1 and 11.8
torch.compile support, bug fixes & more
Pre-built binary wheels require PyTorch 2.4.0
Added
- fMHA: PagedBlockDiagonalGappyKeysMask
- fMHA: heterogeneous queries in triton_splitk
- fMHA: support for paged attention in flash
- fMHA: Added backwards pass for merge_attentions
- fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
- fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
- fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
- fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
- 2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values
Improved
- fMHA: Fixed out-of-bounds reading for Split-K triton implementation
- Profiler: fix bug with modules that take a single tuple as argument
- Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory
Removed
- Removed support for PyTorch version older than 2.2.0
torch.compile support, bug fixes & more
Pre-built binary wheels require PyTorch 2.4.0
Added
- fMHA: PagedBlockDiagonalGappyKeysMask
- fMHA: heterogeneous queries in triton_splitk
- fMHA: support for paged attention in flash
- fMHA: Added backwards pass for merge_attentions
- fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
- fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
- fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
- fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
- 2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values
Improved
- fMHA: Fixed out-of-bounds reading for Split-K triton implementation
- Profiler: fix bug with modules that take a single tuple as argument
- Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory
Removed
- Removed support for PyTorch version older than 2.2.0
[v0.0.27] torch.compile support, bug fixes & more
Added
- fMHA:
PagedBlockDiagonalGappyKeysMask
- fMHA: heterogeneous queries in
triton_splitk
- fMHA: support for paged attention in flash
- fMHA: Added backwards pass for
merge_attentions
- fMHA: Added
torch.compile
support for 3 biases (LowerTriangularMask
,LowerTriangularMaskWithTensorBias
andBlockDiagonalMask
) - some might require PyTorch 2.4 - fMHA: Added
torch.compile
support inmemory_efficient_attention
when passing the flash operator explicitely (egmemory_efficient_attention(..., op=(flash.FwOp, flash.BwOp))
) - fMHA:
memory_efficient_attention
now expects itsattn_bias
argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device. - fMHA:
AttentionBias
subclasses are now constructed by default on thecuda
device if available - they used to be created on the CPU device - 2:4 sparsity: Added
xformers.ops.sp24.sparsify24_ste
for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values
Improved
- fMHA: Fixed out-of-bounds reading for Split-K triton implementation
- Profiler: fix bug with modules that take a single tuple as argument
- Profiler: Added manual trigger for a profiling step, by creating a
trigger
file in the profiling directory
Removed
- Removed support for PyTorch version older than 2.2.0
2:4 sparsity & `torch.compile`-ing memory_efficient_attention
Pre-built binary wheels require PyTorch 2.3.0
Added
- [2:4 sparsity] Added support for Straight-Through Estimator for
sparsify24
gradient (GRADIENT_STE
) - [2:4 sparsity]
sparsify24_like
now supports the cuSparseLt backend, and the STE gradient - Basic support for
torch.compile
for thememory_efficient_attention
operator. Currently only supports Flash-Attention, and without any bias provided. We want to expand this coverage progressively.
Improved
- merge_attentions no longer needs inputs to be stacked.
- fMHA: triton_splitk now supports additive bias
- fMHA: benchmark cleanup
`v0.0.25.post1`: Building binaries for PyTorch 2.2.2
Pre-built binary wheels require PyTorch 2.2.2
2:4 sparsity, fused sequence parallel, torch compile & more
Pre-built binary wheels require PyTorch 2.2.0
Added
- Added components for model/sequence parallelism, as near-drop-in replacements for FairScale/Megatron Column&RowParallelLinear modules. They support fusing communication and computation for sequence parallelism, thus making the communication effectively free.
- Added kernels for training models with 2:4-sparsity. We introduced a very fast kernel for converting a matrix A into 24-sparse format, which can be used during training to sparsify weights dynamically, activations etc... xFormers also provides an API that is compatible with torch-compile, see
xformers.ops.sparsify24
.
Improved
- Make selective activation checkpointing be compatible with torch.compile.
Removed
- Triton kernels now require a GPU with compute capability 8.0 at least (A100 or newer). This is due to newer versions of triton not supporting older GPUs correctly
- Removed support for PyTorch version older than 2.1.0