CUTLASS Kernels

Library of CUTLASS kernels targeting Large Language Models (LLM).

(07-11-24) The official version of FlashAttention-3 will be maintained at https://github.com/Dao-AILab/flash-attention.

We may upload some variants of the FA3 kernels to this repo from time to time for experimentation purposes, but we don't promise the same level of support here.

Building

Download CUTLASS following instructions from: https://github.com/NVIDIA/cutlass.
Modify the (hardcoded) path in the sample compile.sh to your CUTLASS directory.
Run the modified compile.sh as ./compile.sh.

Running

While running the executable make sure to set NVIDIA_TF32_OVERRIDE=1 to enable TF32 mode for cuBLAS for SGEMM. Otherwise, cuBLAS uses float32.

Notes

See README.md in sub-directories for more specific instructions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CUTLASS Kernels

Building

Running

Notes

Files

README.md

Latest commit

History

README.md

File metadata and controls

CUTLASS Kernels

Building

Running

Notes