-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove kpack from the decision of how many elements to copy per thread #1714
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1714 +/- ##
===========================================
- Coverage 78.88% 78.60% -0.28%
===========================================
Files 100 100
Lines 28346 28347 +1
Branches 4130 4130
===========================================
- Hits 22361 22283 -78
- Misses 4368 4402 +34
- Partials 1617 1662 +45
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -127,8 +127,7 @@ computeCopyPerThread(Type elementType, int64_t copyPerThread, int64_t kPerBlock, | |||
copyDPerThread = math_util::gcd(maxVlen, copyPerThread); | |||
copyKPerThread = copyPerThread / copyDPerThread; | |||
} else { | |||
copyKPerThread = | |||
math_util::gcd(maxVlen, math_util::gcd(kpack, copyPerThread)); | |||
copyKPerThread = math_util::gcd(maxVlen, copyPerThread); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copyPerThread = (KPerBlock * MPerBlock) / blockSize
where KPerBlock = kPack * kPacksPerBlock
and blockSize = ((MPerBlock * NPerBlock) / (MPerWave * NPerWave)) * waveSize
therefore
copyPerThread = ((KPerBlock * (MPerWave * NPerWave) * WaveSize) / NPerBlock)
So it seem GCD should always be kPack
gcd(kPack, copyPerThread) = kPack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So previous logic was indeed limiting copyKPerThread
to kPack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should run some more experiments on larger bitwidth dtypes. I am not sure how it affects the register pressures.
You can change rocMLIR SHA similar to this PR on MIGraphX side ROCm/AMDMIGraphX#3743
and MIGraphX CI would run more models with this change.
Ran some tests with BERT Large and Resnet50. I didn't see any performance drop. Performance improvements were very small. Would be good to merge though since migx needs to support int8 as a return type to take full advantage of this PR. |
kpack is used to decide how many K elements to copy per thread. Currently we do:
For example, if kpack=8 and copyPerThread=16, and maxVlen=16, copyKPerThread would be limited to 8. I think kpack should not be used to limit copyKPerThread. This PR changes the code to:
I've run some experiments (gfx942) for int8 (which would be affected by this because maxVlen = 16 in this case) and we can show it improves performance, note these are the unfused int8 kernels from resnet50 (int8 input and int8 output and batch size 32):
I've tested "sdxl-gemm-configs", "attention-configs" and "sdxl-conv-configs". All configs but one have the same performance (-5% to +5%), the following sdxl-conv gets a 1.25x speed up: