TileLang achieves exceptional performance across a variety of computational patterns. Below are selected results showcasing its capabilities:
-
Flash Attention Performance on H100
-
Matmul Performance on GPUs (RTX 4090, A100, H100, MI300X)
-
Dequantize Matmul Performance on A100
V0 | V1 | V2 | V3 | V4 | V5 | V6 | V7 | |
---|---|---|---|---|---|---|---|---|
m | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
n | 16384 | 43008 | 14336 | 57344 | 14336 | 9216 | 36864 | 9216 |
k | 16384 | 14336 | 14336 | 14336 | 57344 | 9216 | 9216 | 36864 |
M0 | M1 | M2 | M3 | M4 | M5 | M6 | M7 | |
---|---|---|---|---|---|---|---|---|
m | 4096 | 4096 | 4096 | 4096 | 8192 | 8192 | 8192 | 8192 |
n | 1024 | 8192 | 28672 | 8192 | 1024 | 8192 | 28672 | 8192 |
k | 8192 | 8192 | 8192 | 28672 | 8192 | 8192 | 8192 | 28672 |
FA0 | FA1 | FA2 | FA3 | FA4 | |
---|---|---|---|---|---|
batch | 1 | 1 | 1 | 1 | 1 |
nheads | 32 | 32 | 32 | 32 | 32 |
seq_len | 512 | 512 | 1024 | 1024 | 4096 |
head_dim | 128 | 128 | 128 | 128 | 128 |
causal | true | false | true | false | true |
CC0 | CC1 | CC2 | CC3 | CC4 | CC5 | |
---|---|---|---|---|---|---|
batch | 1 | 1 | 1 | 64 | 64 | 64 |
nheads | 64 | 64 | 64 | 64 | 64 | 64 |
seq_len | 1024 | 2048 | 8192 | 1024 | 2048 | 8192 |
head_dim | 64 | 64 | 64 | 64 | 64 | 64 |
d_state | 128 | 128 | 128 | 128 | 128 | 128 |
CT0 | CT1 | CT2 | CT3 | CT4 | CT5 | |
---|---|---|---|---|---|---|
batch | 1 | 1 | 1 | 64 | 64 | 64 |
nheads | 64 | 64 | 64 | 64 | 64 | 64 |
seq_len | 1024 | 2048 | 8192 | 1024 | 2048 | 8192 |
head_dim | 64 | 64 | 64 | 64 | 64 | 64 |
d_state | 128 | 128 | 128 | 128 | 128 | 128 |