Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATAL ERROR: Symbol "__nv_tanhf"not found : when using Flux in GPU (not much more details) #36845

Closed
jquetzalcoatl opened this issue Jul 29, 2020 · 7 comments

Comments

@jquetzalcoatl
Copy link

So I'm using Flux to train an encoder, while the decoder has already been trained, i.e., the weights in the decoder are fixed.
I'm running on a GPU of 24GB. The training data is composed by 12000 images of 28x28x1. I'm not sure if it crashes in CPU

The function where my code crashes is (though this same function I use for other NN):

function my_custom_train!(loss, ps, data, opt)
  ps = Flux.Params(ps)
  lost_list = []
  for (x,y) in data
    x, y = x |>gpu, y |>gpu
    gs = gradient(() -> loss(x,y), ps)
    Flux.update!(opt, ps, gs)
    append!(lost_list, loss(x,y))
  end
  return mean(lost_list)
end

where:

function encode(x, enc=enc, toLatentμ=toLatentμ, toLatentσ=toLatentσ, toLabel=toLabel, g=g)
h = enc(x)
μ, logσ, y = toLatentμ(h), toLatentσ(h), toLabel(h)
z = μ + exp.( logσ) .* (randn(size(logσ,1))|>gpu)
h3 = g(y, z)
return h3, μ, logσ, y
end

 enc, toLatentμ, toLatentσ, toLabel = return_nn()

  kl_q_p(μ, logσ) = 0.5f0 * sum(exp.(2f0 .* logσ) + μ.^2 .- 1f0 .- (2 .* logσ))

  M = hparams.batch_size
  loss(x, y, enc=enc, toLatentμ=toLatentμ, 
      toLatentσ=toLatentσ, toLabel=toLabel, g=g) = ((h3, μ, logσ, ŷ) = encode(x, enc, toLatentμ, toLatentσ, toLabel, g);
Flux.mse(h3,x) + Flux.mae(ŷ,y) + (kl_q_p(μ, logσ))* 1 // M  )

  opt = ADAM(0.0001, (0.9, 0.8))

  ps = params(enc, toLatentμ, toLatentσ, toLabel)

The stacktrace I'm getting is:

┌ Warning: `haskey(::TargetIterator, name::String)` is deprecated, use `Target(; name = name) !== nothing` instead.
│   caller = llvm_compat(::VersionNumber) at compatibility.jl:176
└ @ CUDAnative ~/.julia/packages/CUDAnative/C91oY/src/compatibility.jl:176
Ising data imported succesfully!
[ Info: Epoch 1
┌ Warning: `Target(triple::String)` is deprecated, use `Target(; triple = triple)` instead.
│   caller = ip:0x0
└ @ Core :-1
FATAL ERROR: Symbol "__nv_tanhf"not found
signal (6): Aborted
in expression starting at /nfs/nfs7/home/jtoledom/GANs/EncodercGAN.jl:202
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
addModule at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:476
jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:747 [inlined]
jl_finalize_function at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:755
getAddressForFunction at /buildworker/worker/package_linux64/build/src/codegen.cpp:1414
jl_generate_fptr at /buildworker/worker/package_linux64/build/src/codegen.cpp:1510
jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1913
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2154 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
my_custom_train! at /nfs/nfs7/home/jtoledom/GANs/EncodercGAN.jl:139
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2159 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
train at /nfs/nfs7/home/jtoledom/GANs/EncodercGAN.jl:178
unknown function (ip: 0x7f4a96cef54c)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2159 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1700 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:369
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:458
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:409 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:817
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:744
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:911
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:819
jl_parse_eval_all at /buildworker/worker/package_linux64/build/src/ast.c:872
jl_load at /buildworker/worker/package_linux64/build/src/toplevel.c:877
include at ./Base.jl:377
exec_options at ./client.jl:288
_start at ./client.jl:484
jfptr__start_2075.clone_1 at /nfs/nfs7/home/jtoledom/bin/julia-1.4.2/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2145 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
unknown function (ip: 0x401931)
unknown function (ip: 0x401533)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4015d4)
Allocations: 437701200 (Pool: 437640951; Big: 60249); GC: 136
Aborted (core dumped)

However, I checked and every function runs without any issue, including the gradient. I have no clue what is happening. HELP!

I guess with this info is not enough, but I honestly don't know what other relevant info I should provide. Please, let me know.

Thank you.

@yuyichao
Copy link
Contributor

It's probably an issue for whatever you use to generate the GPU code.

@jquetzalcoatl
Copy link
Author

@yuyichao, Thank you.
Do you mind elaborating a bit more? I'm not following.

I use |> gpu to turn things to gpu. Do you mean is something related to that?

If so, what other ways are there to generate GPU code?

Any advice on what to look for to find the bug?

@yuyichao
Copy link
Contributor

I'm not familar with the GPU stack so I can't give you help about that. All what I'm saying is that the issue you obsereve is not a julia bug. It's a bug for whereever your gpu function comes from maybe CUDAnative or something like that.

@dfenn
Copy link

dfenn commented Oct 13, 2020

Did you end up figuring this out? I'm getting something very similar, except the function it can't find is __nv_sqrt.

I've tried using various versions of the system CUDA with various driver combinations, plus the artifacts CUDA, but nothing makes a difference. It looks like it can't find a definition for the square root function, but it seems to be able to find everything else, which is very confusing to me.

@vchuravy
Copy link
Member

This likely means you are executing GPU code on the CPU.

@jquetzalcoatl
Copy link
Author

Make sure you have everything up to date, specially CUDA (I think that was the reason I was getting that message). Do
] st

@dfenn
Copy link

dfenn commented Oct 13, 2020

Thank you both for your replies. I made sure everything was up to date, but that didn't fix the issue for me.

I'm fairly certain the issue I was having is the same as the one documented here: JuliaGPU/CUDA.jl#228, which also supports @vchuravy 's comment. Like the poster in that thread, I had a doubly-wrapped CuArray, and the inner function was a transpose. Getting rid of the transpose fixed the issue for me. It seems that, for reasons I don't really understand, the transpose was taking place on the CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants