Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple gpu loop on CUDA does not return #137

Open
pagnani opened this issue Feb 14, 2022 · 1 comment
Open

Simple gpu loop on CUDA does not return #137

pagnani opened this issue Feb 14, 2022 · 1 comment
Labels

Comments

@pagnani
Copy link

pagnani commented Feb 14, 2022

On julia 1.7.2 creating a new environment with only the included packages (see below)

using Tullio, CUDA, LoopVectorization, CUDAKernels, KernelAbstractions
function gpr(N, L)
    Jseq = rand(Float32, N + 2, N + 2, L, L, 2, 2) |> cu
    conditional = rand(Float32, N + 2, N + 2, L, L, 2, 2) |> cu
    @tullio g[nl, nl1, l, xl, xl1] := conditional[ni, nl, i, l, xi, xl] * Jseq[ni, nj, i, j, xi, xj] * conditional[nj, nl1, j, l+1, xj, xl1] * (i <= l) * (j > l) * (j > i + 1)
    
    return g
end
julia> N=5; L=3; gpr(N,L) 

never returns (and GPU usage 100%)

Pkg status status

  [052768ef] CUDA v3.8.0
  [72cfdca4] CUDAKernels v0.3.3
  [63c18a36] KernelAbstractions v0.7.2
  [bdcacae8] LoopVectorization v0.12.101
  [bc48ee85] Tullio v0.3.3

CuDevice(0): TITAN RTX
CUDA 11.0.0

Thanks a lot!

@mcabbott
Copy link
Owner

mcabbott commented Feb 19, 2022

Thanks for the report. I can reproduce this, but have no idea what causes it.

It works on the CPU, with threads=false (to use KA) and verbose=true (to know):

julia> N=5; L=3; gpr(N,L)
┌ Info: left index ranges
│   nl = Base.OneTo(7)
│   nl1 = Base.OneTo(7)
│   l = 1:2
│   xl = Base.OneTo(2)
└   xl1 = Base.OneTo(2)
┌ Info: reduction index ranges
│   ni = Base.OneTo(7)
│   i = Base.OneTo(3)
│   xi = Base.OneTo(2)
│   nj = Base.OneTo(7)
│   j = Base.OneTo(3)
└   xj = Base.OneTo(2)
[ Info: running KernelAbstractions CPU actor 
7×7×2×2×2 Array{Float32, 5}:
[:, :, 1, 1, 1] =
 19.8817  22.2586  23.2881  20.2121  19.9547  22.5193  20.0603
 ...

On the GPU, it still seems to hang if I comment out * (i <= l) * (j > l) * (j > i + 1).

I wonder if this is just too many loops for KA to handle, or hits some e.g. factorial optimisation step? 11 nested loops is quite deep, and it may be that nobody tested that many. If so, the next step is probably to run it with verbose=2 which will print out the kernel being used, from which we can try to reproduce this without Tullio.

@mcabbott mcabbott added the GPU label Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants