Signal 11 error #294

nataneb · 2016-10-23T19:59:05Z

Hello,
I get this error when I try to run my grappa program:

Graph memory breakdown:
locale_heap_size: 0.133796 GB
global_heap_size: 0.0735908 GB
graph_total_size: 0.207387 GB
Exiting due to signal 11 with siginfo 0x400149f6f270 and payload 0x400149f6f140
srun: error: n25: task 0: Exited with exit code 1

I'm also using your graphlab implementation and running the program on sampa server.
What can I do to solve the problem?
Please, let me know if you need more information.

Thanks

bmyerz · 2016-10-24T22:00:50Z

A backtrace will provide more information

see https://github.com/uwsampa/grappa/blob/master/doc/debugging.md

nelsonje · 2016-10-24T22:45:18Z

Natalia, since you're running on our cluster you should just send me an email with a pointer to the code that's failing, and I'll take a look when I get a chance.

metolent · 2016-10-27T01:01:23Z

I'm sitting with Natalia looking at the issue. We disassembled her binary to look at the assembly at the location of the segfault and it wasn't clear precisely where it was occurring. We observed a number of calls to Boost library functions before and after the faulting address, but weren't able to track them down as the addresses were not included within the binary. The sizes of the vertices used increased, but not by more than 2KB.

I believe we would have to rebuild grappa in debug mode to get a backtrace, right? We don't own the cluster, so I believe that would be problematic.

Is there another debug mechanism you could propose that would enable us to glean some insight from the segfault?

nelsonje · 2016-10-27T01:42:46Z

You're running on our cluster, and Natalia has sent me the details on the segfaulting binary, so it's easy for me to take a look at the backtrace myself---I just haven't had a chance yet.

In fact the backtrace from the non-debug binary is often useful in debugging these sorts of problems (we compile the optimized binary with debugging symbols too), but it's hard to make sense of it without understanding the guts of Grappa. When the backtrace just shows addresses instead of code, it usually means the problem actually occurred before the segfault happened, but it corrupted some scheduler data structure and screwed up the stack. The backtrace won't be helpful in this case.

I'll get back to you two as soon as I've found a moment to take a look at the code.

nataneb · 2016-10-27T05:24:45Z

We tried some experiments. As it's label propagation algorithm, we have an array attached to each vertex. We change its size and realized that after some point, if it's big enough, we get segmentation fault. For example the program works for the arrays with size of 62 and doesn't work for 75 or more. We also tried for another program which is similar to this one and it also fails. It might give you some idea about failure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Signal 11 error #294

Signal 11 error #294

nataneb commented Oct 23, 2016

bmyerz commented Oct 24, 2016

nelsonje commented Oct 24, 2016

metolent commented Oct 27, 2016

nelsonje commented Oct 27, 2016

nataneb commented Oct 27, 2016

Signal 11 error #294

Signal 11 error #294

Comments

nataneb commented Oct 23, 2016

bmyerz commented Oct 24, 2016

nelsonje commented Oct 24, 2016

metolent commented Oct 27, 2016

nelsonje commented Oct 27, 2016

nataneb commented Oct 27, 2016