-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Segfault in threading: complex numbers #13380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I guess this similar to issue #13255. Probably the same? At least the stack trace looks the same. |
Here is a gist to produce on instructions to reproduce these segfaults: https://gist.github.com/ranjanan/dc8f5912a1080415ff6b |
Update: I can reliably reproduce the crash in graph500. From looking at stack traces from several crashes, it seems the GC barrier is leaking somehow. I have seen thread 0 collecting without all other threads waiting in the barrier. A possible reason is that some threads are finishing their work function before hitting GC, when thread 0 hits jl_gc_collect. Still looking into it with @vtjnash. |
@JeffBezanson Does #14190 makes any difference? |
With #14190, all the thread other than the one (edited:) running the GC is waiting in |
The following patch fixes the segfault for me on top of #14190. It now dead locks since diff --git a/src/gc.c b/src/gc.c
index 2c486b3..db2c166 100644
--- a/src/gc.c
+++ b/src/gc.c
@@ -1774,20 +1774,22 @@ static void gc_mark_task_stack(jl_task_t *ta, int d)
{
int stkbuf = (ta->stkbuf != (void*)(intptr_t)-1 && ta->stkbuf != NULL);
// FIXME - we need to mark stacks on other threads
- int curtask = (ta == jl_all_task_states[0].ptls->current_task);
+ int tid = ta->tid;
+ jl_tls_states_t *ptls = jl_all_task_states[tid].ptls;
+ int curtask = (ta == ptls->current_task);
if (stkbuf) {
#ifndef COPY_STACKS
- if (ta != jl_root_task) // stkbuf isn't owned by julia for the root task
+ if (ta != ptls->root_task) // stkbuf isn't owned by julia for the root task
#endif
gc_setmark_buf(ta->stkbuf, gc_bits(jl_astaggedvalue(ta)));
}
if (curtask) {
- gc_mark_stack((jl_value_t*)ta, *jl_all_pgcstacks[0], 0, d);
+ gc_mark_stack((jl_value_t*)ta, *jl_all_pgcstacks[tid], 0, d);
}
else if (stkbuf) {
ptrint_t offset;
#ifdef COPY_STACKS
- offset = (char *)ta->stkbuf - ((char *)jl_stackbase - ta->ssize);
+ offset = (char *)ta->stkbuf - ((char *)ptls->stackbase - ta->ssize);
#else
offset = 0;
#endif |
With the latest commit in #14190 I can run the graph5000 example many times without segfault or dead lock now. =) |
@ranjanan All the tests in your gists passes on #14190 now. The last one is failing because of a bug in the test ranjanan/MT-Workloads#3 . |
@yuyichao Thanks for pointing that out. I have fixed that issue. |
@yuyichao I have been getting a segfault on the
|
I can also reproduce the segfault sometimes now. It seems that someone wrote a NULL pointer to the binding remset. Will check later. |
Should be straightforward to update them. |
@yuyichao @JeffBezanson My gists are now updated. I tried to do a |
I've rebased my branch on current master and also includes all the related fixed I've committed (#14301). You can simply reset your local branch unless you have some other fixes. |
@yuyichao Thanks, all the gists seem to run now without segfaults. |
Segfaults seem to occur for any type of operations on complex numbers.
I ran the following code:
This is the stack trace:
The stack trace for other operators,
-, *, /, ^
were all very similar (julia_+_21471
changes depending on the operator)The text was updated successfully, but these errors were encountered: