[FA fwd D=128] Reduce LDS usage in epilogue #340

oplavsic · 2023-09-26T11:17:52Z

This PR reduces LDS usage in the epilogue by breaking the convert_layout #mfma --> #blocked to multiple convert_layouts, each of which uses less LDS than the original one. The issue with the original convert_layout is that padding is used in LDS to avoid bank conflicts. It's simpler than swizzling but extra LDS space is required.

Note:

This PR only reduces LDS usage if the convert_layout in the epilogue is using more than 64 KB of LDS. And this is the case for FA fwd D=128
For smaller block size, as in the case of FA fwd D=64, the convert_layout in the epilogue does not use more than 64 KB LDS but still uses more LDS compared to the block size, which can harm the occupancy. We'll fix this issue for D=64 in future PR.

zhanglx13 · 2023-09-26T16:42:01Z

Benchmarks

nheads	bs	seqlen	d64-False	d64-True	d128-False	d128-True	d64-bwd
48	4	1024	101	58	92	48	18
48	4	2048	109	79	102	55	23
48	4	4096	108	92	107	78	27
48	4	8192	108	98	110	96	29
48	4	16384	109	101	112	107	30

oplavsic · 2023-09-29T13:24:25Z

Benchmarks

fwd causal=False

This PR

N_CTX d64-waves2 d64-waves3 d128-waves2 d128-waves3
1024 82.841357 99.843707 80.969017 38.079945
2048 88.786731 108.1741 88.17704 40.056298
4096 91.643274 110.840493 92.150436 40.94952
8192 92.731951 108.502042 95.371288 41.567236
16384 93.154725 109.13215 96.29653 41.749272
triton-mlir

N_CTX d64-waves2 d64-waves3 d128-waves2 d128-waves3
1024 82.62418 99.66007 70.760866 13.822745
2048 88.981469 108.095599 74.312649 14.039361
4096 91.777661 110.77687 76.916985 14.165306
8192 92.766691 108.49514 78.215578 14.202539
16384 93.005973 109.051868 78.136207 14.230571

fwd causal=True

This PR

N_CTX d64-waves2 d64-waves3 d128-waves2 d128-waves3
1024 27.837838 6.637805 33.545301 3.285597
2048 39.584447 7.950494 37.155564 3.715085
4096 48.823643 9.170838 52.601148 4.312236
8192 54.126445 9.959126 63.314607 4.54692
16384 56.297693 10.381401 68.738105 4.680015
triton-mlir

N_CTX d64-waves2 d64-waves3 d128-waves2 d128-waves3
1024 27.821601 6.644341 34.41814 5.357317
2048 39.642029 7.943067 48.589815 6.658589
4096 48.824999 9.121136 60.285348 7.383126
8192 54.124562 9.959886 66.399662 7.724
16384 56.184433 10.382775 68.18188 7.868713

bwd kernel

I used the tutorial and this PR and triton-mlir have same perf numbers. But for bwd kernel there is another benchmark repo. We can try that later.

Conclusions

This PR improves the performance of fwd kernel with causal=False and d=128 from 78 to 96 tflops

To get the best perf for d=128, just use the default value of waves-per-eu

To get the best perf for d=64, we still need to set waves-per-eu=3 explicitly

@oplavsic @alefimov-amd @scxiao @jayfurmanek @sunway513

@zhanglx13 Thanks for running benchmarks!
A bit of context that would explain some of these results:

d=128 case uses more registers. I think that in BLOCK_M = 128, BLOCK_N = 64 case, which we use for d=64 it uses a bit more than half of the registers available on 1 SIMD. So basically, it allows only 1 wave per SIMD, but it doesn't utilize all the registers available. That's why I wanted to increase BLOCK sizes, since we already don't have great occupancy, at least to have less grids and utilize all 512 registers available. So it turned out that BLOCK_M = 256, BLOCK_N = 32 used around 512 registers, so still we don't any spills, we still can only run one wave per SIMD, but are doing larger computations. To run this config, I needed to solve some LDS issues that I encountered (detailed explanation in the code).
That is why you can't gain any performance by setting waves_per_eu to anything larger than 1. Since wave needs about 512 registers, trying to fit more than 1 wave to SIMD will result in a lot of spills.

Another way going about this problem would be an opposite of this. Instead of trying to increase register pressure to have better utilization of SIMD, we can try to decrease it as much as we can, so wave needs < 256 registers. That way we could naturally fit 2 waves per SIMD, thus increasing occupancy.

TLDR: This PR improves performance by making the best of already low CU occupancy by increasing register pressure as much as possible without having any spills.

zhanglx13 · 2023-09-29T17:07:06Z

I remembered the default setting of waves-per-eu leads us to have 2 waves per SIMD for d=64 with BLOCK_M=128 and BLOCK_N=64. Have you checked with thread trace that we only have 1 wave per SIMD?

nhaehnle · 2023-10-03T23:40:48Z

lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp

+
+      assert(minIdx >= 0 && minIdx < factorizedNumWarps.size());
+      auto warpsPerCTAPair = factorizedNumWarps[minIdx];
+      std::tie(tmpCvt, newEpliogueCvt) =


typo: epilogue

Good catch, thanks!
What a coincidence, few days ago I watched your talk https://www.youtube.com/watch?v=VbFqA9rvxPs. Very interesting presentation! :)

alefimov-amd · 2023-10-05T19:43:38Z

lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp

+  // clang-format off
+  //
+  // LDS usage of this op is roughly calculated as:
+  // LDS_USAGE = getShapePerCTA(mfma_layout)[0] * getShapePerCTA(blocked_layoput)[1] * sizeof(data_type)


nit pick

Suggested change

// LDS_USAGE = getShapePerCTA(mfma_layout)[0] * getShapePerCTA(blocked_layoput)[1] * sizeof(data_type)

// LDS_USAGE = getShapePerCTA(mfma_layout)[0] * getShapePerCTA(blocked_layout)[1] * sizeof(data_type)

alefimov-amd · 2023-10-05T19:45:28Z

lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp

+  // LDS_USAGE = warpsPerCTA(mfma_layout)[0] * warpsPerCta(blocked_layout)[1] * C,
+  // where C = 32 * sizePerWarp(blocked_layout)[1] * threadsPerWarp(blocked_layout)[1] * sizeof(data_type)


Are these lines redundant?

alefimov-amd · 2023-10-05T20:06:06Z

lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp

+        int tmpCvtLDS = getCvtOpLDSUsage(tmpCvt);
+        int newCvtLDS = getCvtOpLDSUsage(newEpliogueCvt);
+        if (tmpCvtLDS <= LDSSize && newCvtLDS <= LDSSize) {
+          int LDSUsage = tmpCvtLDS + newCvtLDS;


Does lifetimes of scratch buffers of tmpCvt and newEpilogueCvt overlap?
If not I think it is better to choose maximum from these values instead of sum them.

oplavsic requested review from zhanglx13, jayfurmanek, scxiao and alefimov-amd September 26, 2023 11:17

zhanglx13 force-pushed the reduce_lds_usage branch from d7b46e3 to a2e55a7 Compare September 28, 2023 17:06

nhaehnle reviewed Oct 3, 2023

View reviewed changes

alefimov-amd reviewed Oct 5, 2023

View reviewed changes

zhanglx13 mentioned this pull request Oct 10, 2023

Improve FA fwd kernel with causal=True #356

Merged

zhanglx13 force-pushed the reduce_lds_usage branch from a2e55a7 to 94d843a Compare October 11, 2023 04:25

zhanglx13 changed the base branch from triton-mlir to improve_fa_fwd October 11, 2023 04:25

zhanglx13 force-pushed the improve_fa_fwd branch from 69b8563 to 461d72e Compare October 11, 2023 04:28

zhanglx13 force-pushed the reduce_lds_usage branch from 94d843a to 408a555 Compare October 11, 2023 04:28

zhanglx13 changed the title ~~Reduce implicit LDS usage of convert ops~~ [DO NOT MERGE] Reduce implicit LDS usage of convert ops Oct 11, 2023

zhanglx13 force-pushed the improve_fa_fwd branch from 461d72e to 0ae9ffd Compare October 11, 2023 16:44

zhanglx13 force-pushed the reduce_lds_usage branch from 408a555 to 5ebbd48 Compare October 11, 2023 17:22

zhanglx13 force-pushed the improve_fa_fwd branch from 0ae9ffd to c58d6be Compare October 12, 2023 16:09

zhanglx13 force-pushed the reduce_lds_usage branch from a8e659c to a6db42d Compare October 12, 2023 17:42

zhanglx13 changed the base branch from improve_fa_fwd to triton-mlir October 12, 2023 17:42

zhanglx13 added 9 commits October 24, 2023 09:44

rebase onto improve_fwd_fa

8f8b6f1

Fixed a leftover from rebase

eff2b83

rebase onto improve_fa_fwd

2e22055

Reduce tuning space

cb381e5

Disable bwd with D=128

379374e

Add test for d=128

a1bea58

Fix an issue with get_best_config when there is only one config

169d6d5

Added better configs for d=128

c9b6b3b

Fix typos

8348b79

zhanglx13 force-pushed the reduce_lds_usage branch from a6db42d to 8348b79 Compare October 24, 2023 14:52

zhanglx13 changed the title ~~[DO NOT MERGE] Reduce implicit LDS usage of convert ops~~ Reduce LDS usage in epilogue Oct 24, 2023

zhanglx13 changed the title ~~Reduce LDS usage in epilogue~~ [FA fwd D=128] Reduce LDS usage in epilogue Oct 24, 2023

jayfurmanek approved these changes Oct 24, 2023

View reviewed changes

zhanglx13 merged commit 715a589 into triton-mlir Oct 25, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FA fwd D=128] Reduce LDS usage in epilogue #340

[FA fwd D=128] Reduce LDS usage in epilogue #340

oplavsic commented Sep 26, 2023 •

edited by zhanglx13

Loading

zhanglx13 commented Sep 26, 2023 •

edited

Loading

oplavsic commented Sep 29, 2023

Benchmarks

fwd causal=False

fwd causal=True

bwd kernel

Conclusions

zhanglx13 commented Sep 29, 2023

nhaehnle Oct 3, 2023

oplavsic Oct 4, 2023

alefimov-amd Oct 5, 2023

alefimov-amd Oct 5, 2023

alefimov-amd Oct 5, 2023

	// LDS_USAGE = getShapePerCTA(mfma_layout)[0] * getShapePerCTA(blocked_layoput)[1] * sizeof(data_type)
	// LDS_USAGE = getShapePerCTA(mfma_layout)[0] * getShapePerCTA(blocked_layout)[1] * sizeof(data_type)

		// LDS_USAGE = warpsPerCTA(mfma_layout)[0] * warpsPerCta(blocked_layout)[1] * C,
		// where C = 32 * sizePerWarp(blocked_layout)[1] * threadsPerWarp(blocked_layout)[1] * sizeof(data_type)

[FA fwd D=128] Reduce LDS usage in epilogue #340

[FA fwd D=128] Reduce LDS usage in epilogue #340

Conversation

oplavsic commented Sep 26, 2023 • edited by zhanglx13 Loading

zhanglx13 commented Sep 26, 2023 • edited Loading

Benchmarks

oplavsic commented Sep 29, 2023

Benchmarks

fwd causal=False

fwd causal=True

bwd kernel

Conclusions

zhanglx13 commented Sep 29, 2023

nhaehnle Oct 3, 2023

Choose a reason for hiding this comment

oplavsic Oct 4, 2023

Choose a reason for hiding this comment

alefimov-amd Oct 5, 2023

Choose a reason for hiding this comment

alefimov-amd Oct 5, 2023

Choose a reason for hiding this comment

alefimov-amd Oct 5, 2023

Choose a reason for hiding this comment

oplavsic commented Sep 26, 2023 •

edited by zhanglx13

Loading

zhanglx13 commented Sep 26, 2023 •

edited

Loading