Flash-Decoding / Split-KV Attention

Link: https://github.com/facebookresearch/xformers/blob/main/xformers/ops/fmha/triton_splitk.py

Author: Daniel Haziza, others?

Tags: Attention, Decoding

Description:
A Triton kernel for performing attention while additionally parallelizing over the sequence dimension of the keys and values. Useful for fast, low-batch, long-context decoding. Very feature-rich--includes support for paged and/or quantized KV caches.

Triton Version: Triton v2.1.0+

Other Notes:
Accompanied the release of the Flash-Decoding blogpost. See it for more details and an explanation.

Id in triton index: 0002

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0002_Flash-Decoding_Split-KV_Attention.md

0002_Flash-Decoding_Split-KV_Attention.md

Flash-Decoding / Split-KV Attention

Files

0002_Flash-Decoding_Split-KV_Attention.md

Latest commit

History

0002_Flash-Decoding_Split-KV_Attention.md

File metadata and controls

Flash-Decoding / Split-KV Attention