Releases: DefTruth/CUDA-Learn-Notes
Releases · DefTruth/CUDA-Learn-Notes
v2.5
What's Changed
- [HGEMM] Update HGEMM README.md by @DefTruth in #120
- [HGEMM] Add plot tflops function by @DefTruth in #121
- [HGEMM] Add NVIDIA RTX 3090 Laptop perf plot by @DefTruth in #122
- [PERF] Update HGEMM benchmark scripts by @DefTruth in #123
- [HGEMM] Add HGEMM L20/4090 benchmark figures by @DefTruth in #124
- Bump up to v2.5 by @DefTruth in #125
Full Changelog: v2.4.18...v2.5
v2.4.18
What's Changed
- Update README.md by @DefTruth in #115
- [HGEMM] Update HGEMM Supported Matrix by @DefTruth in #116
- [HGEMM] Update HGEMM/SGEMM Supported Matrix by @DefTruth in #117
- [README] Update HGEMM/SGEMM Supported Matrix by @DefTruth in #118
- [HGEMM] Add NVIDIA RTX 4090 benchmark by @DefTruth in #119
Full Changelog: v2.4.17...v2.4.18
v2.4.17
What's Changed
- [NMS] Add nms f32 cuda kernel. by @bear-zd in #102
- [HGEMM] Add some note to collective store by @DefTruth in #103
- [HGEMM] Add HGEMM MMA Col Major Kernel by @DefTruth in #104
- [HGEMM] Update HGEMM benchmark scripts by @DefTruth in #105
- [HGEMM] Add Warp Swizzle as template param by @DefTruth in #106
- [HGEMM] add -Xptxas -v compile flag by @DefTruth in #107
- [HGEMM] Try reduce registers usage by @DefTruth in #108
- [HGEMM] Update HGEMM MMA/WMMA Usage by @DefTruth in #109
- [HGEMM][Docs] Add HGEMM Supported Matrix by @DefTruth in #110
- [HGEMM] Add M=N=K option for benchmark by @DefTruth in #111
- [HGEMM] Update HGEMM/SGEMM Supported Matrix by @DefTruth in #112
- [README] Update HGEMM/SGEMM Supported matrix by @DefTruth in #113
- [Docs] Update HGEMM/SGEMM Supported Matrix by @DefTruth in #114
Full Changelog: v2.4.16...v2.4.17
HGEMM Warp Swizzle/Reg Buffers
HGEMM Up to 115 TFLOPS:L20
What's Changed
Full Changelog: v2.4.13...v2.4.15
HGEMM Up to 113 TFLOPS:L20
What's Changed
- [Mat][Trans] Add f32/f32x4 row/col first kernel by @bear-zd in #89
- [Docs][Contribute] Add How to contribute Notes by @DefTruth in #90
- [HGEMM] optimize SMEM padding, up to 113 TFLOPS by @DefTruth in #92
- [Mat][Trans] Add f32x4_shared/bcf row/col first kernel. by @bear-zd in #91
- [Docs] rename mat_transpose -> mat-transpose by @DefTruth in #93
- [HGEMM] Add GeForce RTX 3080 Laptop benchmark by @DefTruth in #94
- [HGEMM] update HGEMM benchmark option by @DefTruth in #95
- [HGEMM] Refactor HGEMM WMMA 161616 kernels by @DefTruth in #96
- [HGEMM] Update HGEMM WMMA Benchmark by @DefTruth in #97
Full Changelog: v2.4.12...v2.4.13
v2.4.12 SGEMM TF32 Swizzle
What's Changed
- [SGEMM] SGEMM TF32 Thread Block Swizzle by @DefTruth in #84
- [HGEMM] mma4x4_warp4x4_stages with swizzle by @DefTruth in #86
- [SWISH] support Swish F32/F16 kernel by @wangzijian1010 in #85
- [SGEMM] Update SGEMM TF32 Benchmark by @DefTruth in #87
New Contributors
- @wangzijian1010 made their first contribution in #85
Full Changelog: v2.4.11...v2.4.12
v2.4.11 HGEMM Block Swizzle
v2.4.10 SGEMM TF32 Stage 2/3
What's Changed
- [HGEMM] HGEMM WMMA Stage mma4x2+warp4x4 by @DefTruth in #76
- [SGEMM] Add SGEMM WMMA TF32 Stage2/3 by @DefTruth in #77
- [SGEMM] Add cuBLAS SGEMM F32/TF32 baseline by @DefTruth in #78
- [SGEMM] Add Kernel cudaFuncSetAttribute hint by @DefTruth in #79
- [RoPE] Add minimal RoPE f32/f32x4 pack impl by @bear-zd in #80
Full Changelog: v2.4.9...v2.4.10
v2.4.9 HGEMM WMMA Stage
What's Changed
- [HGEMM] Add HGEMM WMMA Double Buffers by @DefTruth in #69
- [Embedding] Add embedding kernel f32/x4/x4_pack, f16/x8/x8_pack by @bear-zd in #68
- [HGEMM] Add HGEMM mma4x2, warp2x4x2 kernel by @DefTruth in #70
- [HGEMM] HGEMM WMMA with Reg double buffers by @DefTruth in #71
- [HGEMM] Add HGEMM WMMA Stage 3/4 Kernel by @DefTruth in #74
- [Softmax] Add online softmax f32x4 pack kernel by @bear-zd in #73
- [HEGMM][Bugfix] fix HGEMM Stage cp.async error by @DefTruth in #75
Full Changelog: v2.4.8...v2.4.9