Skip to content

Latest commit

 

History

History
15 lines (8 loc) · 712 Bytes

0002_Flash-Decoding_Split-KV_Attention.md

File metadata and controls

15 lines (8 loc) · 712 Bytes

Flash-Decoding / Split-KV Attention

Link: https://github.com/facebookresearch/xformers/blob/main/xformers/ops/fmha/triton_splitk.py

Author: Daniel Haziza, others?

Tags: Attention, Decoding

Description:
A Triton kernel for performing attention while additionally parallelizing over the sequence dimension of the keys and values. Useful for fast, low-batch, long-context decoding. Very feature-rich--includes support for paged and/or quantized KV caches.

Triton Version: Triton v2.1.0+

Other Notes:
Accompanied the release of the Flash-Decoding blogpost. See it for more details and an explanation.

Id in triton index: 0002