Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Hpx Program Using srun/mpiexec Error #6538

Open
phil-skillwon opened this issue Aug 27, 2024 · 4 comments
Open

Running Hpx Program Using srun/mpiexec Error #6538

phil-skillwon opened this issue Aug 27, 2024 · 4 comments

Comments

@phil-skillwon
Copy link

Earlier, I discussed with @JiakunYan about the issue of an HPX-based application crashing when using parcelport LCI, and we debugged it. Currently, the crash issue has been resolved.
Please refer to this link:
#6526

Available PMI with Slurm:

srun --mpi=list
MPI plugin types are...
        none
        cray_shasta
        pmi2
        pmix
specific pmix plugin versions available: pmix_v3

Here is my test code:

#include <iostream>
#include "Xlog.h"
#include "Iface.h"
// #include "hpx.hpp"
#include "hpx/hpx_start.hpp"
#include "hpx/version.hpp"
#include "hpx/init.hpp"
#include "hpx/runtime.hpp"
#include "hpx/include/actions.hpp"
#include "hpx/include/lcos.hpp"
#include "hpx/include/async.hpp"
#include "TestHpx.h"

using namespace std;

int hpx_main(int argc, char* argv[])
{
    TRACE("hpx_main function");

    hpx::error_code ec = hpx::make_success_code();
    std::vector<hpx::id_type> localities = hpx::find_all_localities(ec);
    if (hpx::error::success != ec.value()) 
    {
        ERROR("find_all_localities executed failed, %s", ec.get_message().c_str());
        return -1;
    }
    
    if (localities.size() < 2) 
    {
        ERROR("this program requires at least two localities");
        return -2;
    }

    INFO("num of localities: %ld", localities.size());

    for (const auto& loc : localities) 
    {
        hpx::naming::gid_type gid = loc.get_gid();
        std::string address = hpx::get_locality_name(loc).get();
        std::uint32_t localityId = hpx::naming::get_locality_id_from_gid(gid);

        DEBUG("locality id: %d", localityId);
        DEBUG("locality name: %s, id: %08X", address.c_str(), localityId);
    }

    getchar();
    
    return hpx::finalize();
}

int main(int argc, char **argv) 
{
    auto ret = xlogInitFile("conf/LogConf.yaml");
    if (false == ret) 
    {
        cerr << "logger init failed." << endl;
        return -1;
    }

    INFO("hpx demostration running...");

    auto hpxMajor = hpx::major_version();
    auto hpxMinor = hpx::minor_version();
    auto hpxPatch = hpx::subminor_version();
    INFO("hpx version: %d-%d-%d", hpxMajor, hpxMinor, hpxPatch);

    auto hpxCfg = vector<string>();
    hpxCfg.push_back("hpx.handle_signals=0");
    hpxCfg.push_back("hpx.max_idle_loop_count=1000");
    hpxCfg.push_back("hpx.max_idle_backoff_time=1000");
    
    hpx::init_params initArgs;
    initArgs.cfg = std::move(hpxCfg);    

    ret = hpx::start(argc, argv, initArgs);
    if (false == ret) 
    {
        ERROR("hpx runtime init failed");
        return -3;
    }

    getchar();

    WARN("hpx demostration exiting...");

    return hpx::stop();
}

When I start my HPX program with different command parameters, I encounter various issues such as errors, crashes, and the program not entering the hpx_main function, depending on the command used.

Segmentation Fault with:

python3 hpxrun.py ./HpxDemo_d.elf --parcelport lci -l 2 -r srun

Debugging Information:

python3 hpxrun.py ./HpxDemo_d.elf --parcelport lci -l 2 -r srun
2024-08-27 09:55:25.860 - INFO - hpx demostration running...
2024-08-27 09:55:25.860 - INFO - hpx version: 1-10-0
2024-08-27 09:55:25.864 - INFO - hpx demostration running...
2024-08-27 09:55:25.864 - INFO - hpx version: 1-10-0
srun: error: DellNode0: tasks 0-1: Segmentation fault (core dumped)
srun: Terminating StepId=102.0
Process 0 failed with an unexpected error code of 139 (expected 0) <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>

Program not entering hpx_main with:

srun --nodelist=DellNode0,AsusNode1 --ntasks=2 --ntasks-per-node=1 -p MyTestRxe HpxDemo_d.elf

Debugging Information:

srun --nodelist=DellNode0,AsusNode1 --ntasks=2 --ntasks-per-node=1 -p MyTestRxe HpxDemo_d.elf
2024-08-27 10:19:40.428 - INFO - hpx demostration running...
2024-08-27 10:19:40.428 - INFO - hpx version: 1-10-0
2024-08-27 10:19:40.428 - INFO - hpx demostration running...
2024-08-27 10:19:40.428 - INFO - hpx version: 1-10-0

Segmentation Fault with:

srun --mpi=pmix --nodelist=DellNode0,AsusNode1 --ntasks=2 --ntasks-per-node=1 -p MyTestRxe HpxDemo_d.elf

Debugging Information:

srun --mpi=pmix --nodelist=DellNode0,AsusNode1 --ntasks=2 --ntasks-per-node=1 -p MyTestRxe HpxDemo_d.elf
2024-08-27 09:57:50.816 - INFO - hpx demostration running...
2024-08-27 09:57:50.816 - INFO - hpx version: 1-10-0
2024-08-27 09:57:50.829 - INFO - hpx demostration running...
2024-08-27 09:57:50.829 - INFO - hpx version: 1-10-0
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           AsusNode1
  Local device:         rocep2s0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           DellNode0
  Local device:         rocep5s0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
[AsusNode1:441297] *** Process received signal ***
[AsusNode1:441297] Signal: Segmentation fault (11)
[AsusNode1:441297] Signal code: Address not mapped (1)
[AsusNode1:441297] Failing at address: 0x10
[AsusNode1:441297] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f8b26785420]
[AsusNode1:441297] [ 1] /lib/x86_64-linux-gnu/libfabric.so.1(+0x74b87)[0x7f8b23dabb87]
[AsusNode1:441297] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(+0x7fd5)[0x7f8b23ecefd5]
[AsusNode1:441297] [ 3] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mtl_base_select+0xa4)[0x7f8b261aac54]
[AsusNode1:441297] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_cm.so(+0x5e4e)[0x7f8b23f67e4e]
[AsusNode1:441297] [ 5] /lib/x86_64-linux-gnu/libmpi.so.40(mca_pml_base_select+0x1e4)[0x7f8b261b9674]
[AsusNode1:441297] [ 6] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x6ca)[0x7f8b261c678a]
[AsusNode1:441297] [ 7] /lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Init_thread+0x99)[0x7f8b2616a219]
[AsusNode1:441297] [ 8] /usr/local/lib/Skillwon/Hpx/lib/libhpx_core.so(_ZN3hpx4util15mpi_environment4initEPiPPPciiRi+0x5f)[0x7f8b26c81f5f]
[AsusNode1:441297] [ 9] /usr/local/lib/Skillwon/Hpx/lib/libhpx_core.so(_ZN3hpx4util15mpi_environment4initEPiPPPcRNS0_21runtime_configurationE+0x336)[0x7f8b26c83c26]
[AsusNode1:441297] [10] /usr/local/lib/Skillwon/Hpx/lib/libhpx.so.1(_ZN3hpx4util21command_line_handling4callERKNS_15program_options19options_descriptionEiPPcRSt6vectorISt10shared_ptrINS_10components23component_registry_baseEESaISC_EE+0xb52)[0x7f8b272e0192]
[AsusNode1:441297] [11] /usr/local/lib/Skillwon/Hpx/lib/libhpx.so.1(_ZN3hpx6detail12run_or_startERKNS_8functionIFiRNS_15program_options13variables_mapEELb0EEEiPPcRKNS_11init_paramsEb+0x3a9)[0x7f8b273252c9]
[AsusNode1:441297] [12] /usr/local/lib/Skillwon/Hpx/lib/libhpx.so.1(_ZN3hpx6detail10start_implERKNS_8functionIFiRNS_15program_options13variables_mapEELb0EEEiPPcRKNS_11init_paramsEPKcSA_+0x6c)[0x7f8b273263ac]
[AsusNode1:441297] [13] HpxDemo_d.elf(+0xa6c3e)[0x5555a7cedc3e]
[AsusNode1:441297] [14] HpxDemo_d.elf(+0xa665a)[0x5555a7ced65a]
[AsusNode1:441297] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f8b26257083]
[AsusNode1:441297] [16] HpxDemo_d.elf(+0x1828e)[0x5555a7c5f28e]
[AsusNode1:441297] *** End of error message ***
[DellNode0:449751] *** Process received signal ***
[DellNode0:449751] Signal: Segmentation fault (11)
[DellNode0:449751] Signal code: Address not mapped (1)
[DellNode0:449751] Failing at address: 0x10
[DellNode0:449751] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f17069cd420]
[DellNode0:449751] [ 1] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x74b87)[0x7f1704067b87]
[DellNode0:449751] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(+0x7fd5)[0x7f1704193fd5]
[DellNode0:449751] [ 3] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mtl_base_select+0xa4)[0x7f17063f2c54]
[DellNode0:449751] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_cm.so(+0x5e4e)[0x7f17042d5e4e]
[DellNode0:449751] [ 5] /usr/lib/x86_64-linux-gnu/libmpi.so.40(mca_pml_base_select+0x1e4)[0x7f1706401674]
[DellNode0:449751] [ 6] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x6ca)[0x7f170640e78a]
[DellNode0:449751] [ 7] /usr/lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Init_thread+0x99)[0x7f17063b2219]
[DellNode0:449751] [ 8] /usr/local/lib/Skillwon/Hpx/lib/libhpx_core.so(_ZN3hpx4util15mpi_environment4initEPiPPPciiRi+0x5f)[0x7f1706ecaf5f]
[DellNode0:449751] [ 9] /usr/local/lib/Skillwon/Hpx/lib/libhpx_core.so(_ZN3hpx4util15mpi_environment4initEPiPPPcRNS0_21runtime_configurationE+0x336)[0x7f1706eccc26]
[DellNode0:449751] [10] /usr/local/lib/Skillwon/Hpx/lib/libhpx.so.1(_ZN3hpx4util21command_line_handling4callERKNS_15program_options19options_descriptionEiPPcRSt6vectorISt10shared_ptrINS_10components23component_registry_baseEESaISC_EE+0xb52)[0x7f1707529192]
[DellNode0:449751] [11] /usr/local/lib/Skillwon/Hpx/lib/libhpx.so.1(_ZN3hpx6detail12run_or_startERKNS_8functionIFiRNS_15program_options13variables_mapEELb0EEEiPPcRKNS_11init_paramsEb+0x3a9)[0x7f170756e2c9]
[DellNode0:449751] [12] /usr/local/lib/Skillwon/Hpx/lib/libhpx.so.1(_ZN3hpx6detail10start_implERKNS_8functionIFiRNS_15program_options13variables_mapEELb0EEEiPPcRKNS_11init_paramsEPKcSA_+0x6c)[0x7f170756f3ac]
[DellNode0:449751] [13] HpxDemo_d.elf(+0xa6c3e)[0x5592d5b19c3e]
[DellNode0:449751] [14] HpxDemo_d.elf(+0xa665a)[0x5592d5b1965a]
[DellNode0:449751] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f170649f083]
[DellNode0:449751] [16] HpxDemo_d.elf(+0x1828e)[0x5592d5a8b28e]
[DellNode0:449751] *** End of error message ***
srun: error: AsusNode1: task 1: Segmentation fault (core dumped)
srun: error: DellNode0: task 0: Segmentation fault (core dumped)

Failed without Segmentation Fault with:

srun --mpi=pmi2 --nodelist=DellNode0,AsusNode1 --ntasks=2 --ntasks-per-node=1 -p MyTestRxe HpxDemo_d.elf

Debugging Information:

srun --mpi=pmi2 --nodelist=DellNode0,AsusNode1 --ntasks=2 --ntasks-per-node=1 -p MyTestRxe HpxDemo_d.elf
2024-08-27 10:01:23.088 - INFO - hpx demostration running...
2024-08-27 10:01:23.088 - INFO - hpx version: 1-10-0
2024-08-27 10:01:23.091 - INFO - hpx demostration running...
2024-08-27 10:01:23.091 - INFO - hpx version: 1-10-0
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[DellNode0:450384] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: error: DellNode0: task 0: Exited with exit code 1
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[AsusNode1:441818] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: error: AsusNode1: task 1: Exited with exit code 1

GDB Debugging with pmix:

srun --exclusive --pty --mpi=pmix --nodelist=DellNode0,AsusNode1 --ntasks=2 --ntasks-per-node=1 -p MyTestRxe gdb -ex=run HpxDemo_d.elf

Debugging Information:

srun --exclusive --pty --mpi=pmix --nodelist=DellNode0,AsusNode1 --ntasks=2 --ntasks-per-node=1 -p MyTestRxe gdb -ex=run HpxDemo_d.elf
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.2) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from HpxDemo_d.elf...
Starting program: /home/skillwon/Work/Test/Hpx/1-distributed/bin/debug/HpxDemo_d.elf 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5433700 (LWP 456154)]
2024-08-27 10:38:55.237 - INFO - hpx demostration running...
2024-08-27 10:38:55.237 - INFO - hpx version: 1-10-0
[New Thread 0x7ffff51f9700 (LWP 456155)]
[New Thread 0x7ffff4915700 (LWP 456156)]
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           DellNode0
  Local device:         rocep5s0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------

Thread 1 "HpxDemo_d.elf" received signal SIGSEGV, Segmentation fault.
0x00007ffff3663b87 in ?? () from /usr/lib/x86_64-linux-gnu/libfabric.so.1
(gdb) bt
#0  0x00007ffff3663b87 in ?? () from /usr/lib/x86_64-linux-gnu/libfabric.so.1
#1  0x00007ffff378ffd5 in ?? () from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so
#2  0x00007ffff59eec54 in ompi_mtl_base_select () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#3  0x00007ffff38d1e4e in ?? () from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_cm.so
#4  0x00007ffff59fd674 in mca_pml_base_select () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#5  0x00007ffff5a0a78a in ompi_mpi_init () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#6  0x00007ffff59ae219 in PMPI_Init_thread () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#7  0x00007ffff67455f7 in hpx::util::mpi_environment::init (minimal=3, required=3, provided=@0x7ffff6acaa7c: 3) at /home/skillwon/Work/3rdParty/hpx-1.10.0/libs/core/mpi_base/src/mpi_environment.cpp:144
#8  0x00007ffff6745932 in hpx::util::mpi_environment::init (argc=0x7fffffffca8c, argv=0x7fffffffca80, rtcfg=...) at /home/skillwon/Work/3rdParty/hpx-1.10.0/libs/core/mpi_base/src/mpi_environment.cpp:214
#9  0x00007ffff75a440a in hpx::util::command_line_handling::call (this=0x7fffffffd0b0, desc_cmdline=..., argc=1, argv=0x7fffffffd738, component_registries=std::vector of length 0, capacity 0)
    at /home/skillwon/Work/3rdParty/hpx-1.10.0/libs/full/command_line_handling/src/command_line_handling.cpp:996
#10 0x00007ffff762385c in hpx::detail::run_or_start(hpx::function<int (hpx::program_options::variables_map&), false> const&, int, char**, hpx::init_params const&, bool) (f=..., argc=1, argv=0x7fffffffd738, 
    params=..., blocking=false) at /home/skillwon/Work/3rdParty/hpx-1.10.0/libs/full/init_runtime/src/hpx_init.cpp:902
#11 0x00007ffff761ece4 in hpx::detail::start_impl(hpx::function<int (hpx::program_options::variables_map&), false> const&, int, char**, hpx::init_params const&, char const*, char**) (f=..., argc=1, 
    argv=0x7fffffffd738, params=..., hpx_prefix=0x555555619c61 "", env=0x7fffffffd748) at /home/skillwon/Work/3rdParty/hpx-1.10.0/libs/full/init_runtime/src/hpx_init.cpp:193
#12 0x000055555561095e in hpx::start (argc=1, argv=0x7fffffffd738, params=...) at /home/skillwon/Work/Test/Hpx/1-distributed/inc/hpx/hpx_start_impl.hpp:94
#13 0x000055555561037a in main (argc=1, argv=0x7fffffffd738) at /home/skillwon/Work/Test/Hpx/1-distributed/src/main.cpp:112
(gdb) frame 1
#1  0x00007ffff378ffd5 in ?? () from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so
(gdb) info locals
No symbol table info available.
(gdb) frame7 
Undefined command: "frame7".  Try "help".
(gdb) frame 7 
#7  0x00007ffff67455f7 in hpx::util::mpi_environment::init (minimal=3, required=3, provided=@0x7ffff6acaa7c: 3) at /home/skillwon/Work/3rdParty/hpx-1.10.0/libs/core/mpi_base/src/mpi_environment.cpp:144
144                 retval = MPI_Init_thread(nullptr, nullptr, required, &provided);
(gdb) info locals 
is_initialized = 0
retval = 0
(gdb) frame 8
#8  0x00007ffff6745932 in hpx::util::mpi_environment::init (argc=0x7fffffffca8c, argv=0x7fffffffca80, rtcfg=...) at /home/skillwon/Work/3rdParty/hpx-1.10.0/libs/core/mpi_base/src/mpi_environment.cpp:214
214                 init(argc, argv, required, required, provided_threading_flag_);
(gdb) info locals 
this_rank = -1
required = 3
retval = -2147483648
max_tag_p = 0x1
flag = 2
(gdb) frame 9
#9  0x00007ffff75a440a in hpx::util::command_line_handling::call (this=0x7fffffffd0b0, desc_cmdline=..., argc=1, argv=0x7fffffffd738, component_registries=std::vector of length 0, capacity 0)
    at /home/skillwon/Work/3rdParty/hpx-1.10.0/libs/full/command_line_handling/src/command_line_handling.cpp:996
996                 util::mpi_environment::init(&argc, &argv, rtcfg_);
(gdb) info locals 
args = std::vector of length 0, capacity 4
cfgmap = {config_ = std::map with 3 elements = {["hpx.handle_signals"] = "0", ["hpx.max_idle_backoff_time"] = "1000", ["hpx.max_idle_loop_count"] = "1000"}}
error_mode = (hpx::util::commandline_error_mode::allow_unregistered | hpx::util::commandline_error_mode::ignore_aliases)
prepend_command_line = ""
plugin_registries = std::vector of length 44, capacity 64 = {std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1314869317, weak count 1415531860) = {get() = 0x7fffffffdca1}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1599294291, weak count 1599227215) = {get() = 0x7fffffffdcc8}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1229086549, weak count 1899840589) = {get() = 0x7fffffffdcfb}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1296125524, weak count 1933655364) = {get() = 0x7fffffffdd31}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1414090050, weak count 1380533342) = {get() = 0x7fffffffdd60}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1313162319, weak count 1027951936) = {get() = 0x7fffffffddb6}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1414750018, weak count 1599361600) = {get() = 0x7fffffffddee}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1314869317, weak count 1279607886) = {get() = 0x7fffffffde28}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1313169218, weak count 1230266179) = {get() = 0x7fffffffde81}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1028346689, weak count 842018863) = {get() = 0x7fffffffdebe}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1146047839, weak count 1027951700) = {get() = 0x7fffffffdee4}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1835823199, weak count 1970041694) = {get() = 0x7fffffffdf21}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1818848115, weak count 1852798827) = {get() = 0x7fffffffdf86}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1095786306, weak count 1414091857) = {get() = 0x7fffffffdfa5}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1431199554, weak count 1330536268) = {get() = 0x7fffffffdfd8}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 859653973, weak count 792551167) = {get() = 0x7fffffffe000}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1598899540, weak count 1146113363) = {get() = 0x7fffffffe015}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1229086549, weak count 1281311821) = {get() = 0x7fffffffe03d}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1330667353, weak count 859657297) = {get() = 0x7fffffffe065}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1598116420, weak count 1414483777) = {get() = 0x7fffffffe08a}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1883065943, weak count 1275096416) = {get() = 0x7fffffffe0eb}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1146047839, weak count 1027951700) = {get() = 0x7fffffffe12c}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1932485997, weak count 1819044202) = {get() = 0x7fffffffe14f}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1229086549, weak count 1415529549) = {get() = 0x7fffffffe178}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1936866643, weak count 1681535036) = {get() = 0x7fffffffe1a3}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1145652047, weak count 1124085820) = {get() = 0x7fffffffe7d3}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1163281740, weak count 1701068107) = {get() = 0x7fffffffe7f1}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 7368052, weak count 1381321810) = {get() = 0x7fffffffe811}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1298092613, weak count 1130321984) = {get() = 0x7fffffffe830}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1230259013, weak count 826101326) = {get() = 0x7fffffffe859}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 3161419, weak count 1598636875) = {get() = 0x7fffffffe8a4}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1599227222, weak count 843666004) = {get() = 0x7fffffffe8d9}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1414747215, weak count 1028669777) = {get() = 0x7fffffffe92a}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1130319445, weak count 1598901582) = {get() = 0x7fffffffe97b}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1966030149, weak count 1647276658) = {get() = 0x7fffffffe9bf}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1599227222, weak count 843666004) = {get() = 0x7fffffffea04}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1230462809, weak count 1329815373) = {get() = 0x7fffffffea57}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1598378575, weak count 1598575428) = {get() = 0x7fffffffea81}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1229735236, weak count 1094538322) = {get() = 0x7fffffffeaf0}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1230983237, weak count 3161411) = {get() = 0x7fffffffeb24}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1396920146, weak count 1392521788) = {get() = 0x7fffffffeb4f}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1397048399, weak count 1275081276) = {get() = 0x7fffffffeb73}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1598967625, weak count 960316488) = {get() = 0x7fffffffeb8a}, 
  std::shared_ptr<hpx::plugins::plugin_registry_base> (use count 1414090050, weak count 1397704798) = {get() = 0x7fffffffebb6}}
help = {static m_default_line_length = 80, 
  m_caption = "\200xy\365\377\177\000\000\001\000\000\000UU\000\000P\262y\365\377\177\000\000P\262y\365\377\177\000\000\001\000\000\000UU\000\000\340\311\356UUU\000\000x$\347UUU\000\000\220\017\362UUU\000\000?", 
--Type <RET> for more, q to quit, c to continue without paging--

Could these issues be related to the HPX runtime?
Could someone please help me?

@JiakunYan
Copy link
Contributor

@phil-skillwon My previous understanding is that you also get the same behavior with only one task. However, all examples you posted here are with 2 tasks. Could you confirm whether your program runs successfully with one task?

@phil-skillwon
Copy link
Author

@JiakunYan
The test results for a single node and two nodes are almost the same.

@JiakunYan
Copy link
Contributor

@hkaiser could you take a look? It appears not to be a communication bug as it can also be reproduced on a single node.

@github-staff github-staff deleted a comment from yiweifengyan Oct 23, 2024
@phil-skillwon
Copy link
Author

The debug version crashes, but the release version runs normally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@JiakunYan @phil-skillwon and others