Closed
Description
So I'm using Flux to train an encoder, while the decoder has already been trained, i.e., the weights in the decoder are fixed.
I'm running on a GPU of 24GB. The training data is composed by 12000 images of 28x28x1. I'm not sure if it crashes in CPU
The function where my code crashes is (though this same function I use for other NN):
function my_custom_train!(loss, ps, data, opt)
ps = Flux.Params(ps)
lost_list = []
for (x,y) in data
x, y = x |>gpu, y |>gpu
gs = gradient(() -> loss(x,y), ps)
Flux.update!(opt, ps, gs)
append!(lost_list, loss(x,y))
end
return mean(lost_list)
end
where:
function encode(x, enc=enc, toLatentμ=toLatentμ, toLatentσ=toLatentσ, toLabel=toLabel, g=g)
h = enc(x)
μ, logσ, y = toLatentμ(h), toLatentσ(h), toLabel(h)
z = μ + exp.( logσ) .* (randn(size(logσ,1))|>gpu)
h3 = g(y, z)
return h3, μ, logσ, y
end
enc, toLatentμ, toLatentσ, toLabel = return_nn()
kl_q_p(μ, logσ) = 0.5f0 * sum(exp.(2f0 .* logσ) + μ.^2 .- 1f0 .- (2 .* logσ))
M = hparams.batch_size
loss(x, y, enc=enc, toLatentμ=toLatentμ,
toLatentσ=toLatentσ, toLabel=toLabel, g=g) = ((h3, μ, logσ, ŷ) = encode(x, enc, toLatentμ, toLatentσ, toLabel, g);
Flux.mse(h3,x) + Flux.mae(ŷ,y) + (kl_q_p(μ, logσ))* 1 // M )
opt = ADAM(0.0001, (0.9, 0.8))
ps = params(enc, toLatentμ, toLatentσ, toLabel)
The stacktrace I'm getting is:
┌ Warning: `haskey(::TargetIterator, name::String)` is deprecated, use `Target(; name = name) !== nothing` instead.
│ caller = llvm_compat(::VersionNumber) at compatibility.jl:176
└ @ CUDAnative ~/.julia/packages/CUDAnative/C91oY/src/compatibility.jl:176
Ising data imported succesfully!
[ Info: Epoch 1
┌ Warning: `Target(triple::String)` is deprecated, use `Target(; triple = triple)` instead.
│ caller = ip:0x0
└ @ Core :-1
FATAL ERROR: Symbol "__nv_tanhf"not found
signal (6): Aborted
in expression starting at /nfs/nfs7/home/jtoledom/GANs/EncodercGAN.jl:202
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
addModule at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:476
jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:747 [inlined]
jl_finalize_function at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:755
getAddressForFunction at /buildworker/worker/package_linux64/build/src/codegen.cpp:1414
jl_generate_fptr at /buildworker/worker/package_linux64/build/src/codegen.cpp:1510
jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1913
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2154 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
my_custom_train! at /nfs/nfs7/home/jtoledom/GANs/EncodercGAN.jl:139
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2159 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
train at /nfs/nfs7/home/jtoledom/GANs/EncodercGAN.jl:178
unknown function (ip: 0x7f4a96cef54c)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2159 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1700 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:369
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:458
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:409 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:817
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:744
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:911
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:819
jl_parse_eval_all at /buildworker/worker/package_linux64/build/src/ast.c:872
jl_load at /buildworker/worker/package_linux64/build/src/toplevel.c:877
include at ./Base.jl:377
exec_options at ./client.jl:288
_start at ./client.jl:484
jfptr__start_2075.clone_1 at /nfs/nfs7/home/jtoledom/bin/julia-1.4.2/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2145 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
unknown function (ip: 0x401931)
unknown function (ip: 0x401533)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4015d4)
Allocations: 437701200 (Pool: 437640951; Big: 60249); GC: 136
Aborted (core dumped)
However, I checked and every function runs without any issue, including the gradient. I have no clue what is happening. HELP!
I guess with this info is not enough, but I honestly don't know what other relevant info I should provide. Please, let me know.
Thank you.
Metadata
Metadata
Assignees
Labels
No labels