Skip to content

Segfault in threading: complex numbers #13380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ranjanan opened this issue Sep 30, 2015 · 18 comments
Closed

Segfault in threading: complex numbers #13380

ranjanan opened this issue Sep 30, 2015 · 18 comments
Labels
bug Indicates an unexpected problem or unintended behavior multithreading Base.Threads and related functionality

Comments

@ranjanan
Copy link
Contributor

Segfaults seem to occur for any type of operations on complex numbers.
I ran the following code:

using Base.Threads
z = 1 + 2im
@threads all for i = 1:10
z + 2
end

This is the stack trace:

#0  0x00007ffdf016402b in julia_+_21471 (
    z=<error reading variable: DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>, 
    x=<error reading variable: DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at complex.jl:123
#1  0x00007ffff62c519b in jl_apply (nargs=2, args=0x7ffdf135cca0, f=<optimized out>) at julia.h:1337
#2  jl_apply_unspecialized (meth=<optimized out>, meth=<optimized out>, nargs=2, args=0x7ffdf135cca0) at gf.c:29
#3  jl_apply_generic (F=0x7ffdf2117410, args=0x7ffdf135cca0, nargs=2) at gf.c:1672
#4  0x00007ffdf0166130 in julia_#1###_threadsfor#6594_21470 () at threadingconstructs.jl:2
#5  0x00007ffdf016618d in jlcall_#1###_threadsfor#6594_21470 ()
#6  0x00007ffff633b15a in jl_apply (nargs=<optimized out>, args=0x7ffdf1be8018, f=0x7ffdf479db50) at julia.h:1337
#7  ti_run_fun (f=0x7ffdf479db50, args=0x7ffdf1be8010) at threading.c:149
#8  0x00007ffff633b416 in ti_threadfun (arg=0x6a8b00) at threading.c:202
#9  0x00007ffff5f1a182 in start_thread (arg=0x7ffdf135d700) at pthread_create.c:312
#10 0x00007ffff5c4747d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

The stack trace for other operators, -, *, /, ^ were all very similar (julia_+_21471 changes depending on the operator)

@pao pao added multithreading Base.Threads and related functionality bug Indicates an unexpected problem or unintended behavior labels Sep 30, 2015
@ranjanan
Copy link
Contributor Author

ranjanan commented Oct 1, 2015

I guess this similar to issue #13255. Probably the same? At least the stack trace looks the same.

@ranjanan
Copy link
Contributor Author

Here is a gist to produce on instructions to reproduce these segfaults: https://gist.github.com/ranjanan/dc8f5912a1080415ff6b

@JeffBezanson
Copy link
Member

Update: I can reliably reproduce the crash in graph500. From looking at stack traces from several crashes, it seems the GC barrier is leaking somehow. I have seen thread 0 collecting without all other threads waiting in the barrier. A possible reason is that some threads are finishing their work function before hitting GC, when thread 0 hits jl_gc_collect. Still looking into it with @vtjnash.

@yuyichao
Copy link
Contributor

yuyichao commented Dec 3, 2015

@JeffBezanson Does #14190 makes any difference?

@yuyichao
Copy link
Contributor

yuyichao commented Dec 3, 2015

With #14190, all the thread other than the one (edited:) running the GC is waiting in jl_wait_for_gc correctly. I can still get a segfault sometimes and it looks like it's hitting the FIXME in gc_mark_task_stack since it can't handle stack address from other tasks correctly.

@yuyichao
Copy link
Contributor

yuyichao commented Dec 3, 2015

@yuyichao
Copy link
Contributor

yuyichao commented Dec 3, 2015

The following patch fixes the segfault for me on top of #14190. It now dead locks since one thread is waiting from a pthread lock held by a paused thread while the GC is waiting for it (edit: It was the master thread waiting for other ones to complete in managed mode without a safepoint, fixed in #14190). This is the problem I'm trying to address in #14190 and should be fixed when the codegen part is done.

diff --git a/src/gc.c b/src/gc.c                                                  
index 2c486b3..db2c166 100644                                                     
--- a/src/gc.c                                                                    
+++ b/src/gc.c                                                                    
@@ -1774,20 +1774,22 @@ static void gc_mark_task_stack(jl_task_t *ta, int d)      
 {                                                                                
     int stkbuf = (ta->stkbuf != (void*)(intptr_t)-1 && ta->stkbuf != NULL);      
     // FIXME - we need to mark stacks on other threads                           
-    int curtask = (ta == jl_all_task_states[0].ptls->current_task);              
+    int tid = ta->tid;                                                           
+    jl_tls_states_t *ptls = jl_all_task_states[tid].ptls;                        
+    int curtask = (ta == ptls->current_task);                                    
     if (stkbuf) {                                                                
 #ifndef COPY_STACKS                                                              
-        if (ta != jl_root_task) // stkbuf isn't owned by julia for the root task 
+        if (ta != ptls->root_task) // stkbuf isn't owned by julia for the root task                                                                                
 #endif                                                                           
         gc_setmark_buf(ta->stkbuf, gc_bits(jl_astaggedvalue(ta)));               
     }                                                                            
     if (curtask) {                                                               
-        gc_mark_stack((jl_value_t*)ta, *jl_all_pgcstacks[0], 0, d);              
+        gc_mark_stack((jl_value_t*)ta, *jl_all_pgcstacks[tid], 0, d);            
     }                                                                            
     else if (stkbuf) {                                                           
         ptrint_t offset;
 #ifdef COPY_STACKS
-        offset = (char *)ta->stkbuf - ((char *)jl_stackbase - ta->ssize);
+        offset = (char *)ta->stkbuf - ((char *)ptls->stackbase - ta->ssize);
 #else
         offset = 0;
 #endif

@yuyichao
Copy link
Contributor

yuyichao commented Dec 6, 2015

With the latest commit in #14190 I can run the graph5000 example many times without segfault or dead lock now. =)

@yuyichao
Copy link
Contributor

yuyichao commented Dec 6, 2015

@ranjanan All the tests in your gists passes on #14190 now. The last one is failing because of a bug in the test ranjanan/MT-Workloads#3 .

@ranjanan
Copy link
Contributor Author

ranjanan commented Dec 6, 2015

@yuyichao Thanks for pointing that out. I have fixed that issue.

@ranjanan
Copy link
Contributor Author

ranjanan commented Dec 7, 2015

@yuyichao I have been getting a segfault on the ALS.jl gist, even with your fixes. Is that passing for you?
Also, I increased the number of threads to 16 and the graph500 example segfaulted for me:

signal (11): Segmentation fault
while loading no file, in expression starting on line 155
unknown function (ip: 0x7f5de1aaa446)
unknown function (ip: 0x7f5de1aac722)
unknown function (ip: 0x7f5de1aadc3e)
jl_gc_collect at /home/ranjan/julia-threading/usr/bin/../lib/libjulia.so (unknown line)
jl_gc_managed_malloc at /home/ranjan/julia-threading/usr/bin/../lib/libjulia.so (unknown line)
jl_alloc_array_1d at /home/ranjan/julia-threading/usr/bin/../lib/libjulia.so (unknown line)
call at ./essentials.jl:204
bfs at /home/ranjan/MT-Workloads/Graph500/thread/bfs.jl:12
#482###_threadsfor#7338 at /home/ranjan/MT-Workloads/Graph500/thread/graph500.jl:50
unknown function (ip: 0x7f5de1aa7afa)
unknown function (ip: 0x7f5de1aa7d7e)
unknown function (ip: 0x7f5de1b1de97)
unknown function (ip: 0x7f5de1775182)
clone at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: (nil))
Segmentation fault (core dumped)

@yuyichao
Copy link
Contributor

yuyichao commented Dec 7, 2015

I can also reproduce the segfault sometimes now. It seems that someone wrote a NULL pointer to the binding remset. Will check later.

@yuyichao
Copy link
Contributor

yuyichao commented Dec 7, 2015

The issue I saw above should be fixed by #14307, not sure if it's the only issue though.

There's also a few other fixes I haven't committed or finished testing yet to #14190 related to finalizers (move their execution outside gc and allow GC in them).

@yuyichao
Copy link
Contributor

yuyichao commented Dec 7, 2015

@ranjanan With #14060 merged, your tests doesn't work anymore.

@JeffBezanson
Copy link
Member

Should be straightforward to update them.

@ranjanan
Copy link
Contributor Author

ranjanan commented Dec 8, 2015

@yuyichao @JeffBezanson My gists are now updated. I tried to do a git pull on this branch and I found that an automatic merge wasn't possible because of a conflict in test/threads.jl. These seem like extra test cases. Can I just stash this and pull your branch to test it further?

@yuyichao
Copy link
Contributor

yuyichao commented Dec 8, 2015

I've rebased my branch on current master and also includes all the related fixed I've committed (#14301). You can simply reset your local branch unless you have some other fixes.

@ranjanan
Copy link
Contributor Author

ranjanan commented Dec 8, 2015

@yuyichao Thanks, all the gists seem to run now without segfaults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior multithreading Base.Threads and related functionality
Projects
None yet
Development

No branches or pull requests

4 participants