Now there's only one thing to install. It should install over your existing xformers automagically with pip install, and you might want to remove flash_attn first but I don't think it'll try to pull in an external one when the internal exists.
##Updated:
Added a Python 3.10 version. As with the standalone flash attention builds this was built with CUDA 12.3 and supports the CUDA_MODULE_LOADING=lazy environment variable to allow lazy loading of CUDA libraries. Windows defaults to eager normally and has to load and/or build everything from a given file on load which introduces more delay than needed since some kernels may never be used.
python -m xformers.info should look roughly like this:
xFormers 0.0.24+042abc8.d20231218
memory_efficient_attention.cutlassF: available
memory_efficient_attention.cutlassB: available
memory_efficient_attention.decoderF: available
[email protected]: available
[email protected]: available
memory_efficient_attention.smallkF: available
memory_efficient_attention.smallkB: available
memory_efficient_attention.tritonflashattF: unavailable
memory_efficient_attention.tritonflashattB: unavailable
memory_efficient_attention.triton_splitKF: unavailable
indexing.scaled_index_addF: unavailable
indexing.scaled_index_addB: unavailable
indexing.index_select: unavailable
swiglu.dual_gemm_silu: available
swiglu.gemm_fused_operand_sum: available
swiglu.fused.p.cpp: available
Ampere consumer (sm_86) and Ada (sm_89) device code only. If you have A100s or Hoppers you have enough CPU time to do custom builds for those yourself. :-)
While I was trying to build against the nightly v2.3.0 pytorch to compare its supposed internal flash-attention (spoiler, they don't build it in the nightlies, I can't imagine why), I noticed that xformers tries to build with flash attention included in its wheel by default now... or maybe it has for a while... Initially this build failed with thousands of template errors in cutlass::cute because of a bunch of code that's not quite compliant with strict c++17, but then I noticed they'd explicitly disabled permissive mode in MSVC, which explicitly breaks the build. Why did they do this? I suspect either a typo or PCP abuse. While doing this I noticed that torch was updated to 2.1.2, so I retargeted that + python 3.11 + sm_86 & sm_89 and now you don't have to install xformers and flash attention seperately.
I didn't notice any speed improvements over xformers from PyPi and the other wheels I uploaded for flash attention vs this at high resolutions, but when I lowered them to 768x768 with 2x upscale it seemed as though they'd done some optmization for smaller images and inference flew through 50 iterations of euler_a karras at base resolution and 12 of dpmpp_3m_sde_gpu / simple in just over 10 seconds
Anyway here's a python 3.11 wheel. I'll put a 3.10 wheel up here later.
Also TL;DR if you build NVidia's Cutlass related code that pulls in cute, -Xcompiler /permissive -Xcompiler /Zc:preprocessor would seem to do the trick of unbreaking it in C++17 mode. Another trick is programming according to the C++ standard you're targeting in the first place, but that's too much work. ;-)