-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the vectorized loading of BlockLoad #3517
Fix the vectorized loading of BlockLoad #3517
Conversation
Thank you @ChristinaZ for looking into this. It seems that the root cause is that we do have a superfluous template parameter that prevents the compiler to choose the overload implementing the vectorized load:
#431 describes this limitation. However, as stated by Georgii in a comment on the issue, the actual shortcoming is that we need to safeguard against un-aligned pointers: I think the right path forward is, as Georgii suggested in that comment:
Is this something you could take on? |
Yes, I think so. We can use a similar check within function |
I can reproduce the issue on godbolt: https://godbolt.org/z/dfTa9oe1v |
Just to summarize the findings from the offline discussion with @miscco and @ChristinaZ: We have this overload that is supposed to be chosen when the iterator type qualifies for vectorization: cccl/cub/cub/block/block_load.cuh Lines 882 to 886 in 7b19a43
There seems to be two root causes why that overload isn't chosen:
Since the implementation of the overload is a one liner, my suggestion was to add another overload taking a pointer-to-non-const. As mentioned before, we should address the scenario for pointer not aligned to the vectorized type (see #431 (comment)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also add a small test to test/catch2_test_block_load.cu
that loads from an {aligned,unaligned} x {ptr-to-const,ptr-to-non-const}
?
Regarding performance concerns: I also checked our tuning policies for the use of BLOCK_LOAD_VECTORIZE
and it seems it is only considered for DeviceSpmv
(which is about to be dropped) and in DeviceRadixSort
for SM 35 (which is also about to be dropped).
Given that it's not used in any Device*
algorithm today, I'd be happy to have it merged. It finally makes the BLOCK_LOAD_VECTORIZE
algorithm do what it promises to; it provides upside for the vectorized case (i.e., we are seeing considerable improvements in our top-k algorithm) and only minor downside for the case where we now have the extra alignment checks but need to default to non-vectorized loading.
No problem. Let me add this test.
Yes, I just rerun the benchmark with the latest updates in BlockLoad(). I find that adding address alignment checking doesn't hurt the performance of our topK. And the vectorized data loading truly helps improve the performance compared with the default direct loading. |
/ok to test |
Note, we have the following line in the block load test:
Which means, the source file will be compiled twice: (1) once with
|
🟨 CI finished in 1h 47m: Pass: 95%/90 | Total: 2d 14h | Avg: 41m 20s | Max: 1h 08m | Hits: 216%/12772
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
0f36910
to
18d3da9
Compare
/ok to test |
1 similar comment
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great. Thanks a lot for your contribution!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elstehle Do we need a before/after benchmark? AFAIK, we don't have one for block load. I expect the SASS to change (for good!).
My take on the performance would be this:
Christina has some very positive performance data in her top-k work that reassures the performance upside we get from her fix. |
I should learn how to read. |
🟩 CI finished in 1h 45m: Pass: 100%/90 | Total: 2d 15h | Avg: 42m 31s | Max: 1h 14m | Hits: 216%/12730
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-arm64-cpu16 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
1 | linux-amd64-gpu-h100-latest-1 |
5cab148
to
4ae725b
Compare
/ok to test |
🟩 CI finished in 2h 49m: Pass: 100%/90 | Total: 2d 17h | Avg: 43m 35s | Max: 1h 45m | Hits: 216%/12730
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-arm64-cpu16 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
1 | linux-amd64-gpu-h100-latest-1 |
Thanks a lot for your contribution, @ChristinaZ |
Thanks Elias! It's my honor to contribute to CUB! |
Description
Partially addresses #431
What is remaining from #431 after this PR is applying the same logic to
WARP_LOAD_VECTORIZE
.Hi @elstehle elias and @bernhardmgruber,
Background:
We (Elias and I) are working on adding a parallel top-k algorithm into CUB. In this work, we are trying to load data using
Blockloading (BLOCK_LOAD_VECTORIZE)
. However, we found that the data loading used the instruction LDG instead of LDG128. This PR is intended to fix this issue.Description
To show this bug, I wrote a standalone unit test code in this repo.
To run this unit test, you first need to modify the cccl directory
CCCL_DIR
in Makefile. Then runmake
to compile.Then we can get related PTX instructions with
cuobjdump --dump-ptx ./benchmark | grep ld
We can find that it failed to perform vectorized loading. In this repo, we use the template parameter
InputIterator
for the input. Its value is actuallyfloat*
. So it should be able to perform vectorized loading.cuobjdump --dump-ptx ./benchmark | grep ld
, we can get the following result:We can see that after the modification, we can use the vectorized loading successfully.
Checklist
Related issue
#431