Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signal 11 error on multiple machines #276

Open
TimWin opened this issue Mar 28, 2016 · 2 comments
Open

Signal 11 error on multiple machines #276

TimWin opened this issue Mar 28, 2016 · 2 comments

Comments

@TimWin
Copy link

TimWin commented Mar 28, 2016

Hello,
I get this error when trying to run any grappa program on multiple machines:

mpirun -hostfile my_hosts applications/demos/hello_world.exe
. . .
I0328 12:19:22.515194 101851 Grappa.cpp:647]
Shared memory breakdown:
node total: 125.524 GB
locale shared heap total: 62.7622 GB
locale shared heap per core: 62.7622 GB
communicator per core: 0.125 GB
tasks per core: 0.0156631 GB
global heap per core: 15.6905 GB
aggregator per core: 0.0650177 GB
shared_pool current per core: 4.76837e-07 GB
shared_pool max per core: 15.6905 GB
free per locale: 46.8659 GB
free per core: 46.8659 GB

Exiting due to signal 11 with siginfo 0x4003f5326870 and payload 0x4003f5326740
I0328 12:19:22.534696 101851 hello_world.cpp:45] Hello world from locale 0 core 0

Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[29000,1],1]
Exit code: 1

I can successfully execute the programs, e.g. hello_world, on a single machine, but it always chrashes with that signal 11 error when I try to run it on multiple machines.

What can I do to solve that problem?
Please let me know if you need any further information.

Thanks in advance

@bmyerz bmyerz closed this as completed Mar 29, 2016
@bmyerz bmyerz reopened this Mar 29, 2016
@bmyerz
Copy link
Member

bmyerz commented Mar 29, 2016

I think more information is needed, starting with where the signal is thrown.
Try building with Debug mode and running with freeze on error (see https://github.com/uwsampa/grappa/blob/master/doc/debugging.md#debugging).

If the process freezes on the signal, then ssh into the node that had the signal and do gdb attach <pid>. You can find the pid of the running grappa process with something like ps aux | grep grappa. From there you can do a backtrace.

If the process doesn't freeze on the signal then you can have mpirun launch the processes through gdb. (see the #2 answer to question 6 on https://www.open-mpi.org/faq/?category=debugging)

@jeffhammond
Copy link

Here is a stacktrace

[jrhammon@esgmonster prk-repo]$ mpirun -n 1 gdb GRAPPA/Transpose/transpose 10 3600 32
Excess command line arguments ignored. (3600 ...)
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/jrhammon/Work/INTEL/PCL/ESG/PRK/github-official/GRAPPA/Transpose/transpose...done.
Attaching to program: /home/jrhammon/Work/INTEL/PCL/ESG/PRK/github-official/GRAPPA/Transpose/transpose, process 10
ptrace: Operation not permitted.
/home/jrhammon/Work/INTEL/PCL/ESG/PRK/github-official/10: No such file or directory.
(gdb) run 10 1000 32
Starting program: /home/jrhammon/Work/INTEL/PCL/ESG/PRK/github-official/GRAPPA/Transpose/transpose 10 1000 32
[Thread debugging using libthread_db enabled]
warning: File "/opt/gcc/5.3.0/lib64/libstdc++.so.6.0.21-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "/usr/share/gdb/auto-load:/usr/lib/debug:/usr/bin/mono-gdb.py".
To enable execution of this file add
    add-auto-load-safe-path /opt/gcc/5.3.0/lib64/libstdc++.so.6.0.21-gdb.py
line to your configuration file "/home/jrhammon/.gdbinit".
To completely disable this security protection add
    set auto-load safe-path /
line to your configuration file "/home/jrhammon/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
    info "(gdb)Auto-loading safe path"
I0704 15:54:00.408108 110487 Allocator.hpp:185] Allocator is responsible for addresses from 0 to 0x1f6787000
I0704 15:54:00.408323 110487 GlobalMemory.cpp:67] Initialized GlobalMemory with 8430055424 bytes of shared heap.
I0704 15:54:00.412102 110487 Grappa.cpp:647] 
-------------------------
Shared memory breakdown:
  node total:                   62.8088 GB
  locale shared heap total:     31.4044 GB
  locale shared heap per core:  31.4044 GB
  communicator per core:        0.125 GB
  tasks per core:               0.0156631 GB
  global heap per core:         7.8511 GB
  aggregator per core:          0.00247955 GB
  shared_pool current per core: 4.76837e-07 GB
  shared_pool max per core:     7.8511 GB
  free per locale:              23.4102 GB
  free per core:                23.4102 GB
-------------------------
Parallel Research Kernels version 2.16
Grappa matrix transpose: B = A^T
Parallel Research Kernels version 2.16
Grappa matrix transpose: B = A^T
Number of cores         = 1
Matrix order            = 1000
Number of iterations    = 10
Tile size               = 32
Solution validates
Rate (MB/s): 6500.35 Avg time (s): 0.00246141
Summed errors: 0

Program received signal SIGSEGV, Segmentation fault.
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::c_str() const () at /tmp/gcc-5.3.0/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.h:1889
    in /tmp/gcc-5.3.0/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.h

The code GDB is trying to point to is:

      // String operations:
      /**
       *  @brief  Return const pointer to null-terminated contents.
       *
       *  This is a handle to internal data.  Do not modify or dire things may
       *  happen.
      */
      const _CharT*
      c_str() const _GLIBCXX_NOEXCEPT
      { return _M_data(); }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants