You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current chunk mode normally loads 64x64 blocks, do the computation, and then save the resulting hidden state, which could bring I/O burden. In Tri Dao's Mamba2 implementation and xLSTM's chunkwise implementation, they load several 64x64 blocks to save a hidden state every 128 or 256 length, which reduces the I/O cost of saving hidden state and increasing the memory intensity. In my previous preliminary experiments, this could result in a not-small improvement. We'd want to change some kernels rich in matmul to this strategy, like in simple-gla's and deltanet's chunk kernel.
Rationale
The text was updated successfully, but these errors were encountered:
Proposal
The current
chunk
mode normally loads 64x64 blocks, do the computation, and then save the resulting hidden state, which could bring I/O burden. In Tri Dao's Mamba2 implementation and xLSTM's chunkwise implementation, they load several 64x64 blocks to save a hidden state every 128 or 256 length, which reduces the I/O cost of saving hidden state and increasing the memory intensity. In my previous preliminary experiments, this could result in a not-small improvement. We'd want to change some kernels rich in matmul to this strategy, like in simple-gla's and deltanet's chunk kernel.Rationale
The text was updated successfully, but these errors were encountered: