Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation faults on aarch64-linux starting from introduction of extension of KernelAbstractions #677

Open
giordano opened this issue Feb 1, 2025 · 8 comments

Comments

@giordano
Copy link
Member

giordano commented Feb 1, 2025

Starting from #667 we have seen lots of segmentation faults on aarch64:

Julia 1.10 - integration - ubuntu-24.04-arm - aarch64 - packaged libReactant - assertions=false - push

Failed to precompile ReactantStatisticsExt [963ed91e-491b-54ce-bb4b-249dcb1ed2bb] to "/home/runner/.julia/compiled/v1.10/ReactantStatisticsExt/jl_B0YRz3".
2025-02-01 04:56:31.905782: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:51] XLA (re)initializing LLVM with options fingerprint: 11962807958986418783

[4252] signal (11.1): Segmentation fault
in expression starting at /home/runner/work/Reactant.jl/Reactant.jl/src/Precompile.jl:60
last_fde at /workspace/srcdir/gcc-13.2.0/libgcc/unwind-dw2-fde.h:174 [inlined]
classify_object_over_fdes at /workspace/srcdir/gcc-13.2.0/libgcc/unwind-dw2-fde.c:727
init_object at /workspace/srcdir/gcc-13.2.0/libgcc/unwind-dw2-fde.c:888 [inlined]
_Unwind_Find_registered_FDE at /workspace/srcdir/gcc-13.2.0/libgcc/unwind-dw2-fde.c:1210 [inlined]
_Unwind_Find_FDE at /workspace/srcdir/gcc-13.2.0/libgcc/unwind-dw2-fde-dip.c:541
uw_frame_state_for at /workspace/srcdir/gcc-13.2.0/libgcc/unwind-dw2.c:1005
_Unwind_Backtrace at /workspace/srcdir/gcc-13.2.0/libgcc/unwind.inc:303
__backtrace at /lib/aarch64-linux-gnu/libc.so.6 (unknown line)
tsl::CurrentStackTrace[abi:cxx11]() at /home/runner/.julia/artifacts/e08cc4d821f228b8f487acd163930546f0b6ff17/lib/libReactantExtra.so (unknown line)
xla::cpu::RecordCpuCompilerStacktrace() at /home/runner/.julia/artifacts/e08cc4d821f228b8f487acd163930546f0b6ff17/lib/libReactantExtra.so (unknown line)
xla::cpu::CpuCompiler::RunBackend(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) at /home/runner/.julia/artifacts/e08cc4d821f228b8f487acd163930546f0b6ff17/lib/libReactantExtra.so (unknown line)
xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) at /home/runner/.julia/artifacts/e08cc4d821f228b8f487acd163930546f0b6ff17/lib/libReactantExtra.so (unknown line)
xla::TfrtCpuClient::Compile(mlir::ModuleOp, xla::CompileOptions) at /home/runner/.julia/artifacts/e08cc4d821f228b8f487acd163930546f0b6ff17/lib/libReactantExtra.so (unknown line)
ClientCompile at /home/runner/.julia/artifacts/e08cc4d821f228b8f487acd163930546f0b6ff17/lib/libReactantExtra.so (unknown line)
Compile at /home/runner/work/Reactant.jl/Reactant.jl/src/XLA.jl:567 [inlined]
#compile_xla#30 at /home/runner/work/Reactant.jl/Reactant.jl/src/Compiler.jl:1037
compile_xla at /home/runner/work/Reactant.jl/Reactant.jl/src/Compiler.jl:986 [inlined]
#compile#35 at /home/runner/work/Reactant.jl/Reactant.jl/src/Compiler.jl:1055
compile at /home/runner/work/Reactant.jl/Reactant.jl/src/Compiler.jl:1054

Julia 1.11 - integration - ubuntu-24.04-arm - aarch64 - packaged libReactant - assertions=false - push:

[5253] signal 11 (1): Segmentation fault
in expression starting at /home/runner/work/Reactant.jl/Reactant.jl/test/integration/cuda.jl:23
unknown function (ip: 0xffb7703830a0)
xla::cpu::CustomCallThunk::CallUntypedAPI(xla::cpu::Thunk::ExecuteParams const&) at /home/runner/.julia/artifacts/e08cc4d821f228b8f487acd163930546f0b6ff17/lib/libReactantExtra.so (unknown line)
Allocations: 222781207 (Pool: 222775637; Big: 5570); GC: 82
ERROR: LoadError: Package Reactant errored during testing (received signal: 11)
@wsmoses
Copy link
Member

wsmoses commented Feb 1, 2025

Can we run with the debug aarch64 jll and see if it says anything?

@wsmoses
Copy link
Member

wsmoses commented Feb 1, 2025

[and/or we should find access to a machine and run in gdb to see what's happening]

@giordano
Copy link
Member Author

giordano commented Feb 2, 2025

[and/or we should find access to a machine and run in gdb to see what's happening]

I set up a workflow in this branch to use mxschmitt/action-tmate to log into the CI machine. Once you SSH into it you can just run

gdb --args julia --color=yes --project=test test/runtests.jl

and can reproduce the failure, but the problem is that the stack is corrupted:

(gdb) bt
#0  0x0000fffff7d590a0 in ?? ()
#1  0x0000ffffffff76b0 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Now I need to go and can't investigate this further, but note that this seems to happen specifically with REACTANT_TEST_GROUP=integration.

If you want to try it yourself you can restart https://github.com/EnzymeAD/Reactant.jl/actions/runs/13099156774

@giordano
Copy link
Member Author

giordano commented Feb 2, 2025

Couple of comments:

@giordano
Copy link
Member Author

giordano commented Feb 2, 2025

I got slightly more information with a debug build of Reactant:

[10602] signal 11 (1): Segmentation fault
in expression starting at /home/runner/work/Reactant.jl/Reactant.jl/test/integration/cuda.jl:20
unknown function (ip: 0xff1675a9f0a0)
forwarding_custom_call at /proc/self/cwd/external/enzyme_ad/src/enzyme_ad/jax/cpu.cc:9
Allocations: 147210547 (Pool: 147205892; Big: 4655); GC: 42
Segmentation fault (core dumped)

That's https://github.com/EnzymeAD/Enzyme-JAX/blob/859aaf7f00659318f4ca4a524cd7f44cf175c6fc/src/enzyme_ad/jax/cpu.cc#L9.

In GDB:

(gdb) bt
#0  0x0000fffff7d9c0a0 in ?? ()
#1  0x0000ffffffff7ba0 in ?? ()
#2  0x0000ffffb8a21db4 in xla::cpu::CustomCallThunk::Execute (this=0xffffffff79b8, params=...) at external/xla/xla/backends/cpu/runtime/custom_call_thunk.cc:224
#3  0x0000ffffb8a21db4 in xla::cpu::CustomCallThunk::Execute (this=0x0, params=...) at external/xla/xla/backends/cpu/runtime/custom_call_thunk.cc:224
#4  0x0000ffffffff7c78 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

That line is https://github.com/openxla/xla/blob/c2a9a2dfe9494e52f5134b53989e9ca0de307dfe/xla/backends/cpu/runtime/custom_call_thunk.cc#L224, but you can see that the this object is null

@wsmoses
Copy link
Member

wsmoses commented Feb 2, 2025

@giordano just for fun, does EnzymeAD/Enzyme-JAX#306 fix it?

@giordano
Copy link
Member Author

giordano commented Feb 2, 2025

Also, for reference, debug information from XLA by setting TF_CPP_MAX_VLOG_LEVEL=3: https://gist.github.com/giordano/5d56acc38c60ed1923db254d0a1ab59c

@giordano
Copy link
Member Author

giordano commented Feb 2, 2025

just for fun, does EnzymeAD/Enzyme-JAX#306 fix it?

Sadly no.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants