-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alpaka::getPreferredWarpSize(dev)
#2216
Add alpaka::getPreferredWarpSize(dev)
#2216
Conversation
alpaka::getPreferredWarpSize(dev) returns one of the possible warp sizes supported by the device. On devices that support a single work size (cpu, CUDA gpu, ROCm gpu), getPreferredWarpSize(dev) avoids the overhead of wrapping that value in an std::vector. On devices that support multiple warp sizes, the value returned by getPreferredWarpSize(dev) is unspecified. Currently it returns the largest supported value -- but this could change in a future version of alpaka. Signed-off-by: Andrea Bocci <[email protected]>
5e2db10
to
2b368fc
Compare
I am just curious on the purpose of this API. Is the main goal to avoid the heap allocation of Because we could just change the API to either return e.g. a |
@@ -181,10 +182,22 @@ namespace alpaka::trait | |||
auto find64 = std::find(warp_sizes.begin(), warp_sizes.end(), 64); | |||
if(find64 != warp_sizes.end()) | |||
warp_sizes.erase(find64); | |||
// Sort the warp sizes in decreasing order | |||
std::sort(warp_sizes.begin(), warp_sizes.end(), std::greater<>{}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] If the vector size can be large somehow, "initially sorting then finding the element in vector" (logarithmic time) is faster compared to "finding deleting then sorting" (linear time). After sorting, using std::lower_bound function finds in log(n) time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The largest vector I encountered had 5 elements: { 4, 8, 16, 32, 64 }
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mehmetyusufoglu your analysis is correct, but I am with @fwyzard: the supported warp sizes are probably a small set here :) Also, binary search is slower than linear for small sizes due to data dependent access pattern. So @fwyzard's version is probably faster for our use case :D
auto find64 = std::find(warp_sizes.begin(), warp_sizes.end(), 64); | ||
if(find64 != warp_sizes.end()) | ||
warp_sizes.erase(find64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, in C++20, this should be just std::erase(warp_sizes, 64)
. Looking forward to the upgrade :)
@fwyzard if you want the PR merged, please mark the PR as Ready for review, thx! |
Thanks for the review. I've marked it as a draft because I want to figure out first how it interacts with caching the device information. |
alpaka::getPreferredWarpSize(dev)
returns one of the possible warp sizes supported by the device.On devices that support a single work size (cpu, CUDA gpu, ROCm gpu),
getPreferredWarpSize(dev)
avoids the overhead of wrapping that value in anstd::vector
.On devices that support multiple warp sizes, the value returned by
getPreferredWarpSize(dev)
is unspecified. Currently it returns the largest supported value -- but this could change in a future version of alpaka.Add a test for
alpaka::getPreferredWarpSize(dev)
.