Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliable compiler deadlock in odin build and odin check #4615

Open
gfaster opened this issue Dec 22, 2024 · 2 comments
Open

Reliable compiler deadlock in odin build and odin check #4615

gfaster opened this issue Dec 22, 2024 · 2 comments

Comments

@gfaster
Copy link
Contributor

gfaster commented Dec 22, 2024

Context

Please provide any relevant information about your setup. This is important in case the issue is not reproducible except for under certain conditions.

	Odin:    dev-2024-12
	OS:      NixOS 25.05 (Warbler), Linux 6.11.11
	CPU:     Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
	RAM:     32025 MiB
	Backend: LLVM 18.1.8

I also reproduced the error on debug builds on 597fba7

The following code (somewhat) reliably deadlocks the compiler:

package main

Struct1 :: struct { }
Struct2 :: struct { }

StructArr :: struct($C: typeid) {
    field: [dynamic]struct { c: C }
}

arr_for :: proc($T: typeid) -> ^StructArr(T) { }

iterate :: proc($C: typeid) {
    arr := arr_for(C).field
    for c in arr { }
}

iterate_all :: proc() {
    iterate(Struct1)
    iterate(Struct2)
}

arr_init :: proc() {
    clear(&arr_for(Struct1).field)
    clear(&arr_for(Struct2).field)
}

main :: proc() { }

Failure Information (for bugs)

running odin check . on the above resulting in 71 deadlocks when running it 1000 times. Duplicating the above code (adding Struct3 to Struct7, along with clear and iterate calls) seemed to result in more failures with 116 deadlocks out of 1000 trials.

I don't know if it fails at the same spot every time, but I'm seeing thread 1 blocking trying to lock a mutex:

(gdb) thr 1
[Switching to thread 1 (Thread 0x7ffff7e6d180 (LWP 98711))]
#0  0x00007fffeef1513d in syscall () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
(gdb) bt
#0  0x00007fffeef1513d in syscall () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
#1  0x00005555555acd2d in futex_wait (addr=addr@entry=0x7fffe7b327f0, val=2) at src/threading.cpp:698
#2  0x00005555555acccd in mutex_lock_slow (m=m@entry=0x7fffe7b327f0, curr_state=<optimized out>) at src/threading.cpp:355
#3  0x00005555555d5043 in mutex_lock (m=0x7fffe7b327f0) at src/threading.cpp:364
#4  type_set_offsets (t=t@entry=0x7fffe7b32770) at src/types.cpp:3932
#5  0x00005555555d497e in type_align_of_internal (t=t@entry=0x7fffe7b32770, path=<optimized out>, path@entry=0x7fffffff05d0)
    at src/types.cpp:3826
#6  0x00005555555d5ea5 in type_align_of (t=0x7fffe7b32770) at src/types.cpp:3709
#7  0x00005555555946ab in check_parsed_files (c=0x7fffe9ac89b0) at src/checker.cpp:6535
#8  0x000055555557d356 in main (arg_count=<optimized out>, arg_ptr=<optimized out>) at src/main.cpp:3543

While every other thread is blocking here:

(gdb) thr 2
[Switching to thread 2 (Thread 0x7fffee4676c0 (LWP 98712))]
#0  0x00007fffeef1513d in syscall () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
(gdb) bt
#0  0x00007fffeef1513d in syscall () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
#1  0x00005555555acd2d in futex_wait (addr=addr@entry=0x555555815414 <global_thread_pool+36>, val=22915) at src/threading.cpp:698
#2  0x00005555555d0465 in thread_pool_thread_proc (thread=<optimized out>) at src/thread_pool.cpp:235
#3  internal_thread_proc (arg=<optimized out>) at src/threading.cpp:564
#4  0x00007fffeee97d02 in start_thread () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
#5  0x00007fffeef173ac in __clone3 () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6

Steps to Reproduce

  1. run odin check . or odin build . on the above script until it deadlocks (it's expected to fail the check, but this breaks on well-formed programs too)
@gfaster
Copy link
Contributor Author

gfaster commented Dec 23, 2024

After instrumenting BlockingMutex, it looks like a lock was copied in either a locked or waiting state since the blocking lock is the very first occurrence of a mutex at that address

@cg-jl
Copy link

cg-jl commented Dec 23, 2024

Reproduced (with exactly 2 structs) on all three configs (debug, release-native and release builds).

Minimal thread count at -thread-count=2, 3 threads in total. One locks in type_set_offsets, the other two at waiting for a task.

The mutexes are copied correctly.

Instrumented BlockingMutex with a copy constructor in this way:

    BlockingMutex(i32 state) : state_(state) {
        printf("%s %p\n", names[state], this);
    }
    constexpr BlockingMutex() : state_{} {}
    BlockingMutex(BlockingMutex const &other)
        : state_(other.state().load(std::memory_order_acquire)) {
        printf("%s %p %p\n", names[state_], this, &other);
    }

Built with and without CXXFLAGS=-fno-elide-constructors, and both deadlock.
The printf may cause interference with timings but the deadlock still happens in the same place.
All of the prints say unlocked, which means that all the places copy or init the mutexes in their unlocked state.

Context of repro

odin report

	Odin:    dev-2024-12:ad99d20d2
	OS:      EndeavourOS, Linux 6.12.6-arch1-1
	CPU:     AMD Ryzen 7 5825U with Radeon Graphics         
	RAM:     14803 MiB
	Backend: LLVM 18.1.8

I'm using a chain of scripts to launch the binary multiple times. I initially used multiple processes, but with MT on each process I just get more noise. I run the binary again and again until it deadlocks. Using IOT signal to make the binary generate a core dump.

# run_sequential.bash <logs directory>
mkdir -p $1
i=0
while true; do
	(( i += 1 ))
	echo -n .
	if ! timeout -s IOT 5 bash run_one.bash $1; then
		echo $1 :: $i
		break
	fi
done
# run_one.bash <logs directory>
# NOTE: `rr record` does not work, it's too slow (apparently) to find this deadlock.

~/Odin/odin check . 2> $1/err.txt 1> $1/log.txt
exit 0

These points may be of use, may just be noise:

  • Did not reproduce (on either build configuration) with >=18 structs.
  • Did not reproduce on release when pinning the execution to a single core (via taskset)
  • Tried using rr record when running the binary, does not seem to deadlock in ~20s of trying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants