Skip to content

HGEMM Up to 113 TFLOPS:L20

Compare
Choose a tag to compare
@DefTruth DefTruth released this 21 Oct 01:56
· 30 commits to main since this release
0aeb450

What's Changed

  • [Mat][Trans] Add f32/f32x4 row/col first kernel by @bear-zd in #89
  • [Docs][Contribute] Add How to contribute Notes by @DefTruth in #90
  • [HGEMM] optimize SMEM padding, up to 113 TFLOPS by @DefTruth in #92
  • [Mat][Trans] Add f32x4_shared/bcf row/col first kernel. by @bear-zd in #91
  • [Docs] rename mat_transpose -> mat-transpose by @DefTruth in #93
  • [HGEMM] Add GeForce RTX 3080 Laptop benchmark by @DefTruth in #94
  • [HGEMM] update HGEMM benchmark option by @DefTruth in #95
  • [HGEMM] Refactor HGEMM WMMA 161616 kernels by @DefTruth in #96
  • [HGEMM] Update HGEMM WMMA Benchmark by @DefTruth in #97

Full Changelog: v2.4.12...v2.4.13