Skip to content

FATAL ERROR: Symbol "__nv_tanhf"not found : when using Flux in GPU (not much more details) #36845

Closed
@jquetzalcoatl

Description

@jquetzalcoatl

So I'm using Flux to train an encoder, while the decoder has already been trained, i.e., the weights in the decoder are fixed.
I'm running on a GPU of 24GB. The training data is composed by 12000 images of 28x28x1. I'm not sure if it crashes in CPU

The function where my code crashes is (though this same function I use for other NN):

function my_custom_train!(loss, ps, data, opt)
  ps = Flux.Params(ps)
  lost_list = []
  for (x,y) in data
    x, y = x |>gpu, y |>gpu
    gs = gradient(() -> loss(x,y), ps)
    Flux.update!(opt, ps, gs)
    append!(lost_list, loss(x,y))
  end
  return mean(lost_list)
end

where:

function encode(x, enc=enc, toLatentμ=toLatentμ, toLatentσ=toLatentσ, toLabel=toLabel, g=g)
h = enc(x)
μ, logσ, y = toLatentμ(h), toLatentσ(h), toLabel(h)
z = μ + exp.( logσ) .* (randn(size(logσ,1))|>gpu)
h3 = g(y, z)
return h3, μ, logσ, y
end

 enc, toLatentμ, toLatentσ, toLabel = return_nn()

  kl_q_p(μ, logσ) = 0.5f0 * sum(exp.(2f0 .* logσ) + μ.^2 .- 1f0 .- (2 .* logσ))

  M = hparams.batch_size
  loss(x, y, enc=enc, toLatentμ=toLatentμ, 
      toLatentσ=toLatentσ, toLabel=toLabel, g=g) = ((h3, μ, logσ, ŷ) = encode(x, enc, toLatentμ, toLatentσ, toLabel, g);
Flux.mse(h3,x) + Flux.mae(ŷ,y) + (kl_q_p(μ, logσ))* 1 // M  )

  opt = ADAM(0.0001, (0.9, 0.8))

  ps = params(enc, toLatentμ, toLatentσ, toLabel)

The stacktrace I'm getting is:

┌ Warning: `haskey(::TargetIterator, name::String)` is deprecated, use `Target(; name = name) !== nothing` instead.
│   caller = llvm_compat(::VersionNumber) at compatibility.jl:176
└ @ CUDAnative ~/.julia/packages/CUDAnative/C91oY/src/compatibility.jl:176
Ising data imported succesfully!
[ Info: Epoch 1
┌ Warning: `Target(triple::String)` is deprecated, use `Target(; triple = triple)` instead.
│   caller = ip:0x0
└ @ Core :-1
FATAL ERROR: Symbol "__nv_tanhf"not found
signal (6): Aborted
in expression starting at /nfs/nfs7/home/jtoledom/GANs/EncodercGAN.jl:202
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
addModule at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:476
jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:747 [inlined]
jl_finalize_function at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:755
getAddressForFunction at /buildworker/worker/package_linux64/build/src/codegen.cpp:1414
jl_generate_fptr at /buildworker/worker/package_linux64/build/src/codegen.cpp:1510
jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1913
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2154 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
my_custom_train! at /nfs/nfs7/home/jtoledom/GANs/EncodercGAN.jl:139
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2159 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
train at /nfs/nfs7/home/jtoledom/GANs/EncodercGAN.jl:178
unknown function (ip: 0x7f4a96cef54c)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2159 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1700 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:369
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:458
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:409 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:817
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:744
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:911
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:819
jl_parse_eval_all at /buildworker/worker/package_linux64/build/src/ast.c:872
jl_load at /buildworker/worker/package_linux64/build/src/toplevel.c:877
include at ./Base.jl:377
exec_options at ./client.jl:288
_start at ./client.jl:484
jfptr__start_2075.clone_1 at /nfs/nfs7/home/jtoledom/bin/julia-1.4.2/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2145 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
unknown function (ip: 0x401931)
unknown function (ip: 0x401533)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4015d4)
Allocations: 437701200 (Pool: 437640951; Big: 60249); GC: 136
Aborted (core dumped)

However, I checked and every function runs without any issue, including the gradient. I have no clue what is happening. HELP!

I guess with this info is not enough, but I honestly don't know what other relevant info I should provide. Please, let me know.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions