Releases: NeedsMoar/flash-attention-2-builds
Flash Attention 2 - Torch 2.1.1+cu121 (Ampere and Ada consumer and workstation precompiled only)
Just in case someone is mixing ampere and ada cards and doesn't care about PTX (with torch versions and flash attention versions moving so fast, it's probably not worth caring about, but being able to run one of the other builds on a new GPU might still be useful for somebody.
This is just the compiled sm_86 and sm_89 built with --extra-device-vectorization and the same -O3 and --fast-math flags flash attention normally uses so it might be a little faster than the other sm_89 version
Flash Attention 2 - Torch2.1.1+cu121 (Ampere, sm_80 A100 and sm_86 3000 series and A4000-6000, etc) Python 3.10 and 3.11 wheels for Win64
Flash Attention 2.3.6 - Torch2.1.1+cu121 Ampere (sm80, sm86) Wheels for Python 3.10 and 3.11
Initial uploads. A100 (compute_80, sm_80 - PTX + compiled for python 3.10, just PTX for 3.11 after I saw how big that got) and other Ampere (sm_86, compiled, heavier vectorization optimizations enabled on the python 3.11 version.)
Note: Although this should be compatible with anything newer than A100 because of the PTX build of that part, those with Ada Lovelace cards should grab the single arch sm_89 / compute_89 wheels; aside from being under half the size they have optimized prebuilt code for those GPUs.
It's recommended to enable lazy loading via the environment as mentioned in the main README, which should prevent the compile times from being nasty if you have something odd like an sm_87 device or NVidia comes out with sm_92 tomorrow.
As usual remember to install xformers so this will be used.
Notice
I don't have either an A100 or an Ampere card to test these builds on. If they crash on you file a bug and I'll try to figure out why the build is broken; I'm not too keen on it though since adding a single PTX arch to the build triples file size and build time. The larger of the two wheels didn't have MAX_JOBS set or a limit on nvcc threads (which is more important, it spawns them like crazy for each file, I think there were 260 subprocesses under the cmd window I launched it in...) because it wasn't an issue with the sm_89 builds and I hit 270GB of used memory at peak. The best way to build this library still seems to be single arch if you don't want it to take forever. The python311 wheel topped out just under 100GB because I set MAX_JOBS to 48 and --threads 2; at least it's consistent and uses about 1GB per compilation process. I suspect that is because I have large pages enabled for my account rather than actual usage by their compiler but I don't know for sure.
Flash Attention 2 - Torch2.1.1+cu121 Ada Lovelace and Hopper Wheels for Python 3.10 and 3.11
Initial uploads. Ada Lovelace (native) and Hopper (via PTX) only. Volta + Ampere compatible binaries will be uploaded shortly. Just make sure you've got a recent xformers installed or install it after (xformers doesn't have a hard dependency on this but all the better CUDA attention algorithms are here), and "pip install" the correct wheel for your python version.
That's it. I see a ~2it/s improvement at 1024x1024 on regular SD1.5 models, but that jumps to a 50-60% speed improvement at 2048x2048. At normal generation sizes like 512x512 it'll do a little for you but not as much. More importantly it's much more memory efficient than the pytorch 2 version it seems.
Edit: Fixed the descriptive naming; I'd renamed after building, there's a good chance pip will reject them with a nonsensical "bad filename" error message if they're not the same as they were when built. Stupid but whatever.
xformers + integrated flash attention v2.3.6 + sputnik / torch 2.1.2-cu121 / Windows
Now there's only one thing to install. It should install over your existing xformers automagically with pip install, and you might want to remove flash_attn first but I don't think it'll try to pull in an external one when the internal exists.
##Updated:
Added a Python 3.10 version. As with the standalone flash attention builds this was built with CUDA 12.3 and supports the CUDA_MODULE_LOADING=lazy environment variable to allow lazy loading of CUDA libraries. Windows defaults to eager normally and has to load and/or build everything from a given file on load which introduces more delay than needed since some kernels may never be used.
python -m xformers.info should look roughly like this:
xFormers 0.0.24+042abc8.d20231218
memory_efficient_attention.cutlassF: available
memory_efficient_attention.cutlassB: available
memory_efficient_attention.decoderF: available
[email protected]: available
[email protected]: available
memory_efficient_attention.smallkF: available
memory_efficient_attention.smallkB: available
memory_efficient_attention.tritonflashattF: unavailable
memory_efficient_attention.tritonflashattB: unavailable
memory_efficient_attention.triton_splitKF: unavailable
indexing.scaled_index_addF: unavailable
indexing.scaled_index_addB: unavailable
indexing.index_select: unavailable
swiglu.dual_gemm_silu: available
swiglu.gemm_fused_operand_sum: available
swiglu.fused.p.cpp: available
Ampere consumer (sm_86) and Ada (sm_89) device code only. If you have A100s or Hoppers you have enough CPU time to do custom builds for those yourself. :-)
While I was trying to build against the nightly v2.3.0 pytorch to compare its supposed internal flash-attention (spoiler, they don't build it in the nightlies, I can't imagine why), I noticed that xformers tries to build with flash attention included in its wheel by default now... or maybe it has for a while... Initially this build failed with thousands of template errors in cutlass::cute because of a bunch of code that's not quite compliant with strict c++17, but then I noticed they'd explicitly disabled permissive mode in MSVC, which explicitly breaks the build. Why did they do this? I suspect either a typo or PCP abuse. While doing this I noticed that torch was updated to 2.1.2, so I retargeted that + python 3.11 + sm_86 & sm_89 and now you don't have to install xformers and flash attention seperately.
I didn't notice any speed improvements over xformers from PyPi and the other wheels I uploaded for flash attention vs this at high resolutions, but when I lowered them to 768x768 with 2x upscale it seemed as though they'd done some optmization for smaller images and inference flew through 50 iterations of euler_a karras at base resolution and 12 of dpmpp_3m_sde_gpu / simple in just over 10 seconds
Anyway here's a python 3.11 wheel. I'll put a 3.10 wheel up here later.
Also TL;DR if you build NVidia's Cutlass related code that pulls in cute, -Xcompiler /permissive -Xcompiler /Zc:preprocessor would seem to do the trick of unbreaking it in C++17 mode. Another trick is programming according to the C++ standard you're targeting in the first place, but that's too much work. ;-)