[RFC] Increase computation intensity for certain kernels #190

sustcsonglin · 2025-02-17T06:52:58Z

Proposal

The current chunk mode normally loads 64x64 blocks, do the computation, and then save the resulting hidden state, which could bring I/O burden. In Tri Dao's Mamba2 implementation and xLSTM's chunkwise implementation, they load several 64x64 blocks to save a hidden state every 128 or 256 length, which reduces the I/O cost of saving hidden state and increasing the memory intensity. In my previous preliminary experiments, this could result in a not-small improvement. We'd want to change some kernels rich in matmul to this strategy, like in simple-gla's and deltanet's chunk kernel.

Rationale

The text was updated successfully, but these errors were encountered:

Triang-jyed-driung · 2025-02-20T04:50:35Z

RWKV7 would benefit from that too :) Wind's CUDA kernel may have done something similar to that

sustcsonglin added the enhancement New feature or request label Feb 17, 2025

yzhangcs added a commit that referenced this issue Mar 7, 2025

[GLA] Allow for tiled states saving (#190)

b429fcb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Increase computation intensity for certain kernels #190

[RFC] Increase computation intensity for certain kernels #190

sustcsonglin commented Feb 17, 2025

Triang-jyed-driung commented Feb 20, 2025

[RFC] Increase computation intensity for certain kernels #190

[RFC] Increase computation intensity for certain kernels #190

Comments

sustcsonglin commented Feb 17, 2025

Proposal

Rationale

Triang-jyed-driung commented Feb 20, 2025