CUTLASS Kernels

Library of CUTLASS kernels targeting Large Language Models (LLM).

(07-11-24) The official version of FlashAttention-3 will be maintained at https://github.com/Dao-AILab/flash-attention.

We may upload some variants of the FA3 kernels to this repo from time to time for experimentation purposes, but we don't promise the same level of support here.

Building

Download CUTLASS following instructions from: https://github.com/NVIDIA/cutlass.
Modify the (hardcoded) path in the sample compile.sh to your CUTLASS directory.
Run the modified compile.sh as ./compile.sh.

Running

While running the executable make sure to set NVIDIA_TF32_OVERRIDE=1 to enable TF32 mode for cuBLAS for SGEMM. Otherwise, cuBLAS uses float32.

Notes

See README.md in sub-directories for more specific instructions.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
include/utils		include/utils
lib/gemm		lib/gemm
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUTLASS Kernels

Building

Running

Notes

About

Releases

Packages

Contributors 2

Languages

License

ColfaxResearch/cutlass-kernels

Folders and files

Latest commit

History

Repository files navigation

CUTLASS Kernels

Building

Running

Notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages