Release xformers + integrated flash attention v2.3.6 + sputnik / torch 2.1.2-cu121 / Windows · NeedsMoar/flash-attention-2-builds

Now there's only one thing to install. It should install over your existing xformers automagically with pip install, and you might want to remove flash_attn first but I don't think it'll try to pull in an external one when the internal exists.

##Updated:
Added a Python 3.10 version. As with the standalone flash attention builds this was built with CUDA 12.3 and supports the CUDA_MODULE_LOADING=lazy environment variable to allow lazy loading of CUDA libraries. Windows defaults to eager normally and has to load and/or build everything from a given file on load which introduces more delay than needed since some kernels may never be used.

python -m xformers.info should look roughly like this:

xFormers 0.0.24+042abc8.d20231218
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.decoderF:               available
[email protected]: available
[email protected]: available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
memory_efficient_attention.triton_splitKF:         unavailable
indexing.scaled_index_addF:                        unavailable
indexing.scaled_index_addB:                        unavailable
indexing.index_select:                             unavailable
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available

Ampere consumer (sm_86) and Ada (sm_89) device code only. If you have A100s or Hoppers you have enough CPU time to do custom builds for those yourself. :-)

While I was trying to build against the nightly v2.3.0 pytorch to compare its supposed internal flash-attention (spoiler, they don't build it in the nightlies, I can't imagine why), I noticed that xformers tries to build with flash attention included in its wheel by default now... or maybe it has for a while... Initially this build failed with thousands of template errors in cutlass::cute because of a bunch of code that's not quite compliant with strict c++17, but then I noticed they'd explicitly disabled permissive mode in MSVC, which explicitly breaks the build. Why did they do this? I suspect either a typo or PCP abuse. While doing this I noticed that torch was updated to 2.1.2, so I retargeted that + python 3.11 + sm_86 & sm_89 and now you don't have to install xformers and flash attention seperately.

I didn't notice any speed improvements over xformers from PyPi and the other wheels I uploaded for flash attention vs this at high resolutions, but when I lowered them to 768x768 with 2x upscale it seemed as though they'd done some optmization for smaller images and inference flew through 50 iterations of euler_a karras at base resolution and 12 of dpmpp_3m_sde_gpu / simple in just over 10 seconds

Anyway here's a python 3.11 wheel. I'll put a 3.10 wheel up here later.

Also TL;DR if you build NVidia's Cutlass related code that pulls in cute, -Xcompiler /permissive -Xcompiler /Zc:preprocessor would seem to do the trick of unbreaking it in C++17 mode. Another trick is programming according to the C++ standard you're targeting in the first place, but that's too much work. ;-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xformers + integrated flash attention v2.3.6 + sputnik / torch 2.1.2-cu121 / Windows