Exploring Global Reduce Optimization: Could Reducing Memory Roundtrips Improve Performance? #39
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I’d like to bring up a small optimization consideration regarding the global reduce stage. I fully acknowledge the existing approach and the rationale behind using cp.async to bring previously stored C values back into shared memory for FP32 accumulation before writing them back as half. However, I was wondering if we could slightly simplify this step to reduce unnecessary memory roundtrips.
Instead of using Shared → Global → Shared in the pipeline, we could directly load from global memory to registers and perform the reduction using warp shuffle or just a straightforward __ldg global load. By doing so, we could eliminate the intermediate cp.async operation and avoid additional memory traffic.
To validate this, I conducted some performance measurements. While the changes didn't always yield improvements—some cases showed slightly worse results—there were also cases where performance doubled as the problem size grew. This suggests that avoiding redundant memory copies could have some positive impact, particularly for smaller kernels where every cycle counts.
I’d love to hear your thoughts on this! Do you think reducing memory roundtrips in this way could be beneficial in some cases?
ASIS Code
Changed Code
Elapsed Time (Unit Test) #ASIS
Elapsed Time (Unit Test) #Changed Code