Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apps/golang-example hangs after printing Go version #850

Closed
myechuri opened this issue Feb 2, 2017 · 11 comments
Closed

apps/golang-example hangs after printing Go version #850

myechuri opened this issue Feb 2, 2017 · 11 comments

Comments

@myechuri
Copy link
Contributor

myechuri commented Feb 2, 2017

Attempt at running apps/golang-example from master branch on main repo failed with signal 11:

# scripts/build mode=debug image=golang-example

# scripts/run.py -d -c1 -e "/hello"
OSv v0.24-295-gf6323f6
eth0: 192.168.122.15
syscall(): unimplemented system call 200
syscall(): unimplemented system call 231
trying to execute null pointer
[backtrace]
0x000000000022d644 <abort(char const*, ...)+270>
0x0000000000489ba5 <page_fault+166>
0x0000000000488a96 <???+4754070>
0x0000000000489a1a <???+4758042>
0x0000100000c852fb <???+13128443>
0x0000100000ca04e6 <???+13239526>
0x00000000006a6feb <???+6975467>
0x000000000044cddb <std::function<void ()>::operator()() const+49>
0x00000000005c1c47 <sched::thread::main()+27>
0x00000000005bdb9b <thread_main_c+38>
0x0000000000489a12 <???+4758034>

GDB console:

(gdb) bt
#0  processor::cli_hlt () at arch/x64/processor.hh:248
#1  0x0000000000209f9c in arch::halt_no_interrupts () at arch/x64/arch.hh:48
#2  0x0000000000499b10 in osv::halt () at arch/x64/power.cc:24
#3  0x000000000022d66b in abort (fmt=0xaa9ed0 "trying to execute null pointer") at runtime.cc:130
#4  0x0000000000489ba6 in page_fault (ef=0xffff800003160068) at arch/x64/mmu.cc:29
#5  <signal handler called>
#6  0x0000000000000000 in ?? ()
#7  0x000000000047e4ce in call_signal_handler (frame=0x2000001ffc80) at arch/x64/signal.cc:77
#8  <signal handler called>
#9  0x0000100000c74d29 in runtime.sysargs (argc=16076528, argv=0x0) at /usr/local/go/src/runtime/os_linux.go:192
#10 0x0000100000c852fc in runtime.args (c=16076528, v=0x0) at /usr/local/go/src/runtime/runtime1.go:64
#11 0x0000100000ca04e7 in runtime.rt0_go () at /usr/local/go/src/runtime/asm_amd64.s:143
#12 0x0000004100f54ef0 in ?? ()
#13 0x0000000000000000 in ?? ()

Root cause seems to be this golang issue. Trying out this fix from go branch on @benoit-canet 's fork:

root@ubuntu:~/benoit-canet-osv/osv# scripts/run.py -c1 -e "/hello"
OSv v0.23-265-gffd88ce
eth0: 192.168.122.15
Failed looking up main. Powering off.
root@ubuntu:~/benoit-canet-osv/osv# 

@benoit-canet 's go branch uses osv-apps at commit cc25ca6, which is one commit behind this commit which added GoMain support. Will submit patch for updating apps module on this fork.

After pulling golang-example/hello.go from the GoMain commit, the app launched fine, but hangs after printing Go version.

# scripts/run.py -c1 -vd
OSv v0.23-265-gffd88ce
eth0: 192.168.122.15
sigaltstack() stubbed
Hello, 世界
Go version: go1.7.3

GDB output does not show anything out of ordinary:

(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 1 (CPU#0 [running]) sched::cpu_set::operator bool (this=0xffff800001afe7e0) at include/osv/sched.hh:105
(gdb) bt
#0  sched::cpu_set::operator bool (this=0xffff800001afe7e0) at include/osv/sched.hh:105
#1  0x00000000005be7ca in sched::cpu::handle_incoming_wakeups (this=0xffff800001afb040) at core/sched.cc:432
#2  0x00000000005be652 in sched::cpu::do_idle (this=0xffff800001afb040) at core/sched.cc:392
#3  0x00000000005be787 in sched::cpu::idle (this=0xffff800001afb040) at core/sched.cc:423
#4  0x00000000005bdc03 in sched::cpu::<lambda()>::operator()(void) const (__closure=0xffff800001fda070) at core/sched.cc:165
#5  0x00000000005c6a72 in std::_Function_handler<void(), sched::cpu::init_idle_thread()::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/5/functional:1871
#6  0x000000000044e8a8 in std::function<void ()>::operator()() const (this=0xffff800001fda070)
    at /usr/include/c++/5/functional:2267
#7  0x00000000005c17cc in sched::thread::main (this=0xffff800001fda040) at core/sched.cc:1171
#8  0x00000000005bd9ae in sched::thread_main_c (t=0xffff800001fda040) at arch/x64/arch-switch.hh:164
#9  0x000000000048b5b3 in thread_main () at arch/x64/entry.S:113
(gdb) 

Filing the issue to track the hang because i expected GoMain's return to terminate the application OSv instance (similar to how native-example behaves). @benoit-canet , can you please correct if i am wrong?

Also, is this branch the current top of tree for Golang work? Is there a specific reason for keeping this work out of master on main repo? Thanks!

@benoit-canet
Copy link
Contributor

benoit-canet commented Feb 2, 2017 via email

@myechuri
Copy link
Contributor Author

myechuri commented Feb 2, 2017

Thanks for the quick response, @benoit-canet !

Perhaps that unloading the go shared library after go main return would help clean up a lot of stuff the runtime init put in place and hence help osv exit cleanly.

Thanks for the pointer. Will try unloading go shared library after go main and update the issue.

I am working on others topics which predate go on OSv's priority for
ScyllaDB. if you feel like improving this go branch and upstreaming it and
filling more missing syscalls you will get some reviews.

Thanks for explaining the reason for existence of the fork. i will be able to work on improving the go branch starting mid February - thanks in advance for reviews!

@benoit-canet
Copy link
Contributor

benoit-canet commented Feb 6, 2017 via email

@nyh
Copy link
Contributor

nyh commented Feb 6, 2017 via email

@myechuri
Copy link
Contributor Author

myechuri commented Mar 8, 2017

Thanks, @benoit-canet @nyh ! It does look like the hang is most likely related to redzone issue 808. We have an extra OS thread created for Golang's runtime.goexit goroutine (by oneNewExtraM function at runtime/proc.go ):

runtime.Stack report:
 goroutine 17 [running, locked to thread]:
main.GoMain()
	/root/osv/apps/golang-example/hello.go:25 +0x4c4
main._cgoexpwrap_8b25087e9d89_GoMain()
	command-line-arguments/_obj/_cgo_gotypes.go:45 +0x16

goroutine 18 [syscall, locked to thread]:
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2086 +0x1

After /hello goroutine and corresponding OS thread are terminated, runtime.goexit goroutine got scheduled to run in a different thread:

(gdb) osv info threads
...
 199 (0xffff800003707040) /hello          cpu0 status::terminated ?? at arch/x64/entry.S:113 vruntime  7.33126e-19
 200 (0xffff80000370d040) >>>init         cpu0 status::waiting waitqueue::wait(lockfree::mutex&) at core/waitqueue.cc:56 vruntime  3.22551e-19
 201 (0xffff800003a06040) >>>>init        cpu0 status::running sched_yield() at libc/libc.cc:107 vruntime -7.75072e-20
 202 (0xffff800003a1e040) >>>>init        cpu0 status::waiting waitqueue::wait(lockfree::mutex&) at core/waitqueue.cc:56 vruntime  4.73809e-21

runtime.goexit goroutine and its corresponding OSv thread called osyield syscall which led to stack corruption:

(gdb) bt
#0  sched_yield () at libc/libc.cc:108
#1  0x000000000059cc6b in syscall (number=24) at linux.cc:358
#2  0x000000000059d93a in syscall_wrapper (number=24) at linux.cc:381
#3  0x0000000000481453 in syscall_entry () at arch/x64/entry.S:253
#4  0x00001000014b51c7 in runtime.osyield () at /usr/local/go/src/runtime/sys_linux_amd64.s:478
#5  0x000010000146b035 in runtime.lock (l=0x100001797bd0 <runtime.sched+16>) at /usr/local/go/src/runtime/lock_futex.go:92
#6  0x000010000149232e in runtime.sysmon () at /usr/local/go/src/runtime/proc.go:3619
#7  0x000010000148b632 in runtime.mstart1 () at /usr/local/go/src/runtime/proc.go:1126
#8  0x000010000148b506 in runtime.mstart () at /usr/local/go/src/runtime/proc.go:1096
#9  0x00001000014ef213 in crosscall_amd64 () at /usr/local/go/src/runtime/cgo/gcc_amd64.S:35
#10 0x0000000000000003 in ?? ()
#11 0x0000000000000030 in ?? ()
#12 0x00000000000000f3 in ?? ()
#13 0xffffa0000377d540 in ?? ()
#14 0x000000c4200009c0 in ?? ()
#15 0x000010000148b4a0 in ?? () at /usr/local/go/src/runtime/proc.go:1014
#16 0x00001000014eef1b in threadentry (v=<optimized out>) at /usr/local/go/src/runtime/cgo/gcc_linux_amd64.c:101
#17 0x0000000000693143 in pthread_private::pthread::<lambda()>::operator()(void) const (__closure=0xffffa0000326fc00)
    at libc/pthread.cc:115
#18 0x0000000000695c3c in std::_Function_handler<void(), pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/6/functional:1740
#19 0x000000000044486e in std::function<void ()>::operator()() const (this=0xffff800003aa7070) at /usr/include/c++/6/functional:2136
#20 0x00000000005b72a4 in sched::thread::main (this=0xffff800003aa7040) at core/sched.cc:1171
#21 0x00000000005b3412 in sched::thread_main_c (t=0xffff800003aa7040) at arch/x64/arch-switch.hh:164
#22 0x00000000004813d3 in thread_main () at arch/x64/entry.S:113

(gdb) n
syscall (number=24) at linux.cc:329
329	    SCOPE_LOCK(fpu);
...
(gdb) n
syscall_entry () at arch/x64/entry.S:287
287	    lea 128(%rsp), %rsp
(gdb) n
syscall_entry () at arch/x64/entry.S:291
291	    jmpq *%rcx
(gdb) bt
#0  syscall_entry () at arch/x64/entry.S:291
#1  0x0000100000000001 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) n
0x00001000014b4b77 in ?? ()
(gdb) n
Cannot find bounds of current function

rcx holding rip is messed up after returning from sycall. Will try @nyh 's suggestion in second half of second comment at 808 and update the issue with findings.

@HawxChen
Copy link
Contributor

HawxChen commented Jul 6, 2017

Hi @benoit-canet and @nyh , even though Go routine's stacks are very small originally, according to Golang's official document, Go runtime should manage stacks well generally. I read the solution, #808, provided by nyh and observed related stack allocation of corresponding threads in OSv. Moreover, Go Runtime debugging options were also used in OSv but it still has the same result: OSv just hangs .

./scripts/run.py -c1 -vd -e"--env=GODEBUG=sbrk=0,allocfreetrace=1,gcshrinkstackoff=1,gcstackbarrieroff=0,gcstackbarrierall=1,gcrescanstacks=1,gctrace=2,scheddetail=4000,schedtrace=4000 /hello"

It seems that there is conflicting stack management between Go runtime and OSv. Can anyone kindly give me more hints or guidance on how to figure out its compatibilities and root causes? That way I'd be able to understand the issue better before tackling it with confidence.

@HawxChen
Copy link
Contributor

HawxChen commented Jul 25, 2017

Hi @nyh, @benoit-canet, and @myechuri, here are the newest findings, a work-around patch, and an environment to reproduce this issue with the work-around solution based on the June 2017 codebase.

Can anyone kindly guide me how to proceed:

It seems that this golang-example app hanged because _terminated was not assigned to be TRUE through application_runtime's destructor. Application_runtime should have been destructed normally. It is clear that other threads still reference it. According to the following log, thread 199 executed GoMain() and exited successfully; however, golang-example app's other threads(200, 201, and 202) were still executing . In short, these threads still referenced the same application_runtime instance so that the golang-example app hanged.

198 (0xffff800003017040) dhcp cpu0 status::waiting dhcp::dhcp_worker::dhcp_worker_fn() at core/dhcp.cc:830 vruntime 9.15069e-21
200 (0xffff80000356f040) >>>init cpu0 status::waiting waitqueue::wait(lockfree::mutex&) at core/waitqueue.cc:56 vruntime 8.03134e-20
201 (0xffff800003675040) >>>>init cpu0 status::waiting futex(int*, int, int, timespec const*, int*, int) at linux.cc:81 vruntime 1.68077e-20
timers: 60.628110678
202 (0xffff80000387b040) >>>>init cpu0 status::waiting waitqueue::wait(lockfree::mutex&) at core/waitqueue.cc:56 vruntime 1.35432e-20
`

I could make the Golang-example execute and exit successfully through this work-around patch with side effects because of inappropriate _terminated assignment, under a limited condition (only 1 core)

Finally, here is my development branch to reproduce this issue based on the June 2017 codebase.
Steps:
0. The initialization steps have no difference against official OSv.

  1. ./go_build.sh
  2. ./go_run.sh

Thank you in advance, I really need some inspiration.

@nyh
Copy link
Contributor

nyh commented Jul 25, 2017

@HawxChen I'm really not familiar with all the details here ( @benoit-canet and @myechuri probably are), but I would like to point out that OSv is deliberately waiting for all the threads started by Go to terminate, not just its main thread, so if Go leaves behind any running threads - either intentionally or because of a bug, OSv will not shut down and wait (forever) for all these threads to exit.

We had a similar issue in Java: Java deliberately leaves behind some threads (such as GC threads and compilation threads) when it exits, so if we want java.so to ever exit when the Java program exit, we need to get rid of these threads. You can see in modules/java-base/java.cc how main() ends with calls to osv::application::unsafe_stop_and_abandon_other_threads() which will forcefully (and not entirely safely...) kill all these threads. Alternatively you can call exit() to force a shutdown regardless of what else is running (although that is ugly, in case other stuff is running on the same VM as well).

A separate question is why these extra threads are left behind. Are they threads that were deliberately left behind? Or threads left behind by some bug? My guess is that they we deliberately (or just by negligence) left behind, but I think @myechuri suggested above that there is some stack-related bug that caused these threads to hang.

@HawxChen
Copy link
Contributor

Hi @nyh , thank you for your response.

I want to share other findings through doubling stacks' sizes and Golang's debugging options.

According to your guess and the following #808 quote, in one of my experimental branches , I doubled all stacks used in exceptions, interrupts, threads' function call, and even booting; moreover, their sizes were obviously larger than Go routines' default size.

Some applications have tiny stacks. Notably, the Go language (see #522) uses a separate tiny stack for each "go-routine", and only grows this stack when needed. But at the point a SYSCALL instruction is used, the remaining available stack can be small, and it will not further grow during the SYSCALL call. So the OSv code can overflow the application's stack and potentially write to unrelated objects in memory.

In the meantime, I also forced Golang to use traditional sbrk() instead of their fancy stack allocation through debugging options:

./scripts/run.py -c1 -vd -e"--env=GODEBUG=sbrk=1,allocfreetrace=1,gcshrinkstackoff=1,gcstackbarrieroff=0,gcstackbarrierall=1,gcrescanstacks=1,gctrace=2,scheddetail=4000,schedtrace=4000 /hello"

However, It did not work.

I welcome any idea, discussion, and advice. Thank you in advance for your guidance.

@nyh
Copy link
Contributor

nyh commented Jul 27, 2017

@HawxChen are we still discussing the specific original problem mentioned in the top of this issue, which is that the "hello world" program works but never exits, or something else?

The stack issue discussed in #808 is a real problem, and @benoit-canet saw it causing crashes in long runs of http server with high load (for example), and he saw that increasing the stack size worked around that problem. HOWEVER, this doesn't mean that every problem we have with Go is caused by this stack problem! @myechuri suspected that some of the Go threads never exit because of a stack overflow which I didn't fully understand (why don't we see any crash or something coming from this? why a "hang"?), but maybe there's another reason why they never exit?

golang/go#11100 suggests that certain Go features like signals create new OS threads and there is no way to ask for them to be shut down. As I said above we had exactly the same problem with Java, and had to add a osv::application::unsafe_stop_and_abandon_other_threads() call to the Java main to ensure all the linger threads are forcefully killed when the main user thread is done.

@HawxChen
Copy link
Contributor

HawxChen commented Jul 28, 2017

Hi @nyh, Thank you for your detailed reply.

Yes, I am still discussing the same program.

For now, I have no advanced thoughts on this issue so I will keep this issue in my mind. Hopefully, in the near future, I will be more familiar with OSv in order to solve this issue successfully. I decided to implement #808 because it is not only interesting but also helpful for practicing OSv.

I appreciate your discussion.

@nyh nyh closed this as completed in 4380083 Apr 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants