[HGEMM] Add MMA 16816 swizzle, Up to 115 TFLOPS (#98)

* Update hgemm_mma.cu * Update README.md * Update hgemm.py * Update hgemm.cu * Update hgemm_mma.cu * Update hgemm.cu * Update hgemm.py * Update README.md * Update hgemm_mma.cu * Update hgemm.py * Update hgemm.cu * Update hgemm_mma.cu * Update README.md * Update hgemm.py * Update README.md * Update README.md * Update hgemm_mma_stage.cu * Update hgemm.py * Update hgemm.cu * Update README.md * Update README.md * Update hgemm_mma_stage.cu * Update hgemm_mma_stage.cu
DefTruth · Oct 21, 2024 · a2934b9 · a2934b9
1 parent 0aeb450
commit a2934b9
Show file tree

Hide file tree

Showing 6 changed files with 1,247 additions and 314 deletions.
diff --git a/README.md b/README.md
@@ -147,6 +147,10 @@
 | ✔️ [hgemm_wmma_m32n8k16....dbuf*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|  
 | ✔️ [hgemm_wmma_m16n16k16...stages*](./hgemm/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|  
 | ✔️ [hgemm_wmma_m16n16k16...swizzle*](./hgemm/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|  
+| ✔️ [hgemm_mma_m16n8k16...naive*](./hgemm/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|  
+| ✔️ [hgemm_mma_m16n8k16...mma2x4*](./hgemm/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|  
+| ✔️ [hgemm_mma_m16n8k16...stages*](./hgemm/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|  
+| ✔️ [hgemm_mma_m16n8k16...swizzle*](./hgemm/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|  
 | ✔️ [sgemv_k32_f32](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|  
 | ✔️ [sgemv_k128_f32x4](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|  
 | ✔️ [sgemv_k16_f32](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|  
@@ -158,7 +162,7 @@
 | ✔️ [hard_nms cpp only](./nms/nms.cc)|f32|/|/|⭐️|  
 | ✔️ [notes v1(deprecated)](./notes-v1.cu)|f32|f32|/|⭐️|  
 
-👉TIPS: * means using **Tensor Cores(MMA PTX)**, otherwise, using CUDA Cores by default.
+👉TIPS: * means using **Tensor Cores(MMA/WMMA)**, otherwise, using CUDA Cores by default.
 
 ## 0x01 📖 博客目录