-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Directed rounding #2576
base: master
Are you sure you want to change the base?
Directed rounding #2576
Conversation
I think it would be better to use RoundingMode arguments instead of different functions. Base does this as well: https://github.com/JuliaLang/julia/blob/2590e675885b97579a7531c343a546f6f5bbcbe5/base/rounding.jl#L469-L472. |
What do you think if we define aliases with RoundingMode? Something like: I tried to pass a vector of RoundingMode to a CUDA kernel, but it did not work. |
Why wouldn't you unify the
You can always inspect the SASS code using |
I implemented the calls using a RoundingMode arg; I kept the function names and aliased them, but if you prefer I can remove the function names. I was able to expose the MMA interface, and tested against the test kernel in function kernel_wmma_f64_lowlevel(a_dev, b_dev, c_dev, d_dev)
a_frag = WMMA.llvm_wmma_load_a_col_m8n8k4_global_stride_f64(pointer(a_dev), 8)
b_frag = WMMA.llvm_wmma_load_b_col_m8n8k4_global_stride_f64(pointer(b_dev), 4)
c_frag = WMMA.llvm_wmma_load_c_col_m8n8k4_global_stride_f64(pointer(c_dev), 8)
#d_frag = WMMA.llvm_wmma_mma_col_col_m8n8k4_f64(a_frag, b_frag, c_frag)
#d_frag = WMMA.llvm_wmma_mma_col_col_m8n8k4_f64(a_frag, b_frag, c_frag, RoundToZero)
#d_frag = WMMA.llvm_wmma_mma_col_col_m8n8k4_f64(a_frag, b_frag, c_frag, RoundUp)
d_frag = WMMA.llvm_wmma_mma_col_col_m8n8k4_f64(a_frag, b_frag, c_frag, RoundDown)
WMMA.llvm_wmma_store_d_col_m8n8k4_global_stride_f64(pointer(d_dev), d_frag, 8)
return nothing
end
function call_kernel()
m = n = 8
k = 4
dtype_a = dtype_b = Float64
dtype_c = dtype_d = Float64
d_a = CUDA.rand(dtype_a, m, k)
d_b = CUDA.rand(dtype_b, k, n)
d_c = CUDA.rand(dtype_c, m, n)
d_d = CUDA.zeros(dtype_d, m, n)
CUDA.@sync @cuda kernel_wmma_f64_lowlevel(d_a, d_b, d_c, d_d)
return nothing
end Everything seems to work fine! I also checked the numerical results for the operations and everything is fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Keeping the _rn
etc names seems fine to me seeing how CUDA defines them as well.
# The binary operations as add, sub, mul, div have been implemented through a macro | ||
|
||
function test_add!(out, x, y) | ||
I = threadIdx().x | ||
if I == 1 | ||
out[I] = CUDA.add(x, y, RoundNearest) | ||
elseif I == 2 | ||
out[I] = CUDA.add(x, y, RoundToZero) | ||
elseif I == 3 | ||
out[I] = CUDA.add(x, y, RoundUp) | ||
elseif I == 4 | ||
out[I] = CUDA.add(x, y, RoundDown) | ||
end | ||
return | ||
end | ||
|
||
out_d = CuArray(zeros(4)) | ||
@cuda threads = 4 test_add!(out_d, 1.0, 2^(-54)) | ||
out_h = Array(out_d) | ||
|
||
function test_sub!(out, x, y) | ||
I = threadIdx().x | ||
if I == 1 | ||
out[I] = CUDA.sub(x, y, RoundNearest) | ||
elseif I == 2 | ||
out[I] = CUDA.sub(x, y, RoundToZero) | ||
elseif I == 3 | ||
out[I] = CUDA.sub(x, y, RoundUp) | ||
elseif I == 4 | ||
out[I] = CUDA.sub(x, y, RoundDown) | ||
end | ||
return | ||
end | ||
|
||
out_d = CuArray(zeros(4)) | ||
@cuda threads = 4 test_sub!(out_d, 1.0, 2^(-53)) | ||
out_h = Array(out_d) | ||
|
||
function test_mul!(out, x, y) | ||
I = threadIdx().x | ||
if I == 1 | ||
out[I] = CUDA.mul(x, y, RoundNearest) | ||
elseif I == 2 | ||
out[I] = CUDA.mul(x, y, RoundToZero) | ||
elseif I == 3 | ||
out[I] = CUDA.mul(x, y, RoundUp) | ||
elseif I == 4 | ||
out[I] = CUDA.mul(x, y, RoundDown) | ||
end | ||
return | ||
end | ||
|
||
out_d = CuArray(zeros(4)) | ||
@cuda threads = 4 test_mul!(out_d, 1.0 - 2^(-52), 1.0 + 2^(-52)) | ||
out_h = Array(out_d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how this part is still relevant to the 'defining an intrinsic' tutorial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left only one example
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated change.
src/device/intrinsics/wmma.jl
Outdated
"f32" => Float32, | ||
"f64" => Float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added intrinsics calls for WMMA with directed rounding modes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you keep that to a separate PR? We also currently don't support Float64 WMMA, see #1426.
5ceeb9f
to
a4a071b
Compare
5d585c4
to
c850163
Compare
In this pull request we add call to the intrinsics for directed add, sub, mul, div and fma and a small tutorial on how to expose intrinsics. The ptx code is as expected.