Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/shm: Add unmap_region function #10364

Merged
merged 1 commit into from
Oct 24, 2024
Merged

prov/shm: Add unmap_region function #10364

merged 1 commit into from
Oct 24, 2024

Conversation

zachdworkin
Copy link
Contributor

This function is mainly for the niche case where on progress_connreq a peer is added to the map with its region needing to be mapped, and then after mapping it, it's discovered that the newly mapped peer's process died. In this case we need to unmap them and free any resources that were opened for communicating with them.

With the creation of this function we can rework smr_map_del to use it as common code. This requires changes to smr_av.c where smr_map_del is called. smr_map_del is now an iterable function. This is to optimize smr_map_cleanup to use ofi_rbmap_foreach to only cleanup peers that exist.

prov/shm/src/smr_av.c Outdated Show resolved Hide resolved
prov/shm/src/smr_av.c Outdated Show resolved Hide resolved
prov/shm/src/smr_util.c Outdated Show resolved Hide resolved
prov/shm/src/smr_util.c Outdated Show resolved Hide resolved
prov/shm/src/smr_util.c Outdated Show resolved Hide resolved
@shijin-aws
Copy link
Contributor

bot:aws:retest

prov/shm/src/smr_av.c Outdated Show resolved Hide resolved
@zachdworkin
Copy link
Contributor Author

@shijin-aws whats the AWS failure?

@shijin-aws
Copy link
Contributor

Lots of failures for Open MPI RMA test, one example is

--------------------------------- Captured Err ---------------------------------
2024-09-25 20:25:14,589 - INFO - utils - Running on 2 nodes, 36 processes per node
2024-09-25 20:25:14,589 - INFO - utils - Executing command: ssh -oStrictHostKeyChecking=no 172.31.70.196 "source /home/ec2-user/PortaFiducia/env/bin/activate; source /etc/profile; cat /sys/class/infiniband/*/ports/*/hw_counters/tx_bytes"
2024-09-25 20:25:14,817 - INFO - utils - Executing command: ssh -oStrictHostKeyChecking=no 172.31.70.196 "source /home/ec2-user/PortaFiducia/env/bin/activate; source /etc/profile; cat /sys/class/infiniband/*/ports/*/hw_counters/rx_bytes"
2024-09-25 20:25:15,074 - INFO - utils - Executing command: export PATH=/home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/bin:$PATH;export LD_LIBRARY_PATH=/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib;/home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/bin/mpirun --wdir . -n 2 --hostfile /home/ec2-user/PortaFiducia/hostfile --map-by ppr:1:node --timeout 1800 -x LD_LIBRARY_PATH=/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib -x PATH  /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency   2>&1 | tee /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/run/one-sided/osu_get_acc_latency/node2-ppn1.txt
2024-09-25 20:27:23,143 - INFO - utils - mpirun output:
# OSU MPI_Get_accumulate latency Test v7.0-lrbison3
# Window creation: MPI_Win_create
# Synchronization: MPI_Win_lock/unlock
# Size          Latency (us)
1                     115.87
2                     115.86
4                     115.92
8                     115.92
16                    116.45
32                    116.40
64                    116.48
128                   116.48
256                   116.81
512                   117.18
1024                  118.10
2048                  119.34
4096                  121.69
8192                  164.23
16384                 250.02
32768                 423.76
65536                 840.90
131072               1673.14
262144               3347.70
524288               6700.41
1048576             13422.45
2097152             26848.66
4194304             53889.38
[ip-172-31-70-196:09307] *** Process received signal ***
[ip-172-31-70-196:09307] Signal: Segmentation fault (11)
[ip-172-31-70-196:09307] Signal code: Address not mapped (1)
[ip-172-31-70-196:09307] Failing at address: 0x7fd84d852048
[ip-172-31-70-196:09307] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7fd860e488e0]
[ip-172-31-70-196:09307] [ 1] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbd241)[0x7fd84fb64241]
[ip-172-31-70-196:09307] [ 2] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbd4c6)[0x7fd84fb644c6]
[ip-172-31-70-196:09307] [ 3] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbb6a8)[0x7fd84fb626a8]
[ip-172-31-70-196:09307] [ 4] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x69fcf)[0x7fd84fb10fcf]
[ip-172-31-70-196:09307] [ 5] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x6a6ce)[0x7fd84fb116ce]
[ip-172-31-70-196:09307] [ 6] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x6a7a5)[0x7fd84fb117a5]
[ip-172-31-70-196:09307] [ 7] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/openmpi/mca_btl_ofi.so(mca_btl_ofi_finalize+0xbd)[0x7fd85415419d]
[ip-172-31-70-196:09307] [ 8] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(+0x6fee4)[0x7fd86059aee4]
[ip-172-31-70-196:09307] [ 9] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(mca_base_framework_close+0x5d)[0x7fd860582ecd]
[ip-172-31-70-196:09307] [10] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(mca_base_framework_close+0x5d)[0x7fd860582ecd]
[ip-172-31-70-196:09307] [11] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libmpi.so.40(ompi_mpi_finalize+0x749)[0x7fd8610a6819]
[ip-172-31-70-196:09307] [12] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency[0x4029dc]
[ip-172-31-70-196:09307] [13] /lib64/libc.so.6(__libc_start_main+0xea)[0x7fd860aab13a]
[ip-172-31-70-196:09307] [14] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency[0x402ada]
[ip-172-31-70-196:09307] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-70-196 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@shijin-aws
Copy link
Contributor

I can try reproducing locally to get more logs. But it seems to me the segfaults happened in some locking steps

@zachdworkin
Copy link
Contributor Author

I can try reproducing locally to get more logs. But it seems to me the segfaults happened in some locking steps

I can test it locally. We dont test OMPI in our CI so those tests weren't run on our end.

@shijin-aws
Copy link
Contributor

Does any provider other than efa can access shm inside Open MPI run now? Open MPI cannot run with shm itself as it doesn't support FI_REMOTE_COMM

@zachdworkin
Copy link
Contributor Author

I have a "hack" for OMPI to force it to run shm without needing another provider. I just remove taht dependency and then run tests that dont need it. Im pretty sure this bug is all related so fixing one of those tests will likely fix most/all of them

@shijin-aws
Copy link
Contributor

Thanks. The interesting thing is that the error only happens for our inter-node test, in this case we only touch shm during the control interface operations (av_insert, ep open, mr reg), no data transfer

@zachdworkin
Copy link
Contributor Author

do you have a stack trace with the av_insert path? I removed the lock inside of map_to_region and pushed it to be outside. I only moved the locks in prov/shm. I think I need to take a look at where it is called in prov/efa. Thats likely the bug

@shijin-aws
Copy link
Contributor

The bug is triggered at the MPI_Finalize stage, I think it's in the clean up stage

@zachdworkin
Copy link
Contributor Author

I wonder if there is a path we dont test in Intel CI that removes nodes from the rb tree after I already remove them with the updated call. Ill take a look there

@zachdworkin
Copy link
Contributor Author

@shijin-aws I have been unsuccessful in recreating the same bug locally. Do you know which call is causing the segfault? I assume its in smr_ep_close.

I also found a missed lock in smr_ep.c:227 needs to be locked and unlocked afterwards

@j-xiong
Copy link
Contributor

j-xiong commented Sep 27, 2024

@zachdworkin Is this new push just for debugging?

@zachdworkin
Copy link
Contributor Author

its to fix the miss of the lock in smr_ep.c:227 and for debugging

@shijin-aws
Copy link
Contributor

@zachdworkin I will get back to you today on this

@shijin-aws
Copy link
Contributor

@zachdworkin The error disappears for your latest push

(env) ubuntu@ip-172-31-40-58:~/PortaFiducia/build/libraries/libfabric/main/source/libfabric$ mpirun --prefix /opt/amazon/openmpi  -n 2 -x FI_LOG_LEVEL=warn -x LD_LIBRARY_PATH ~/PortaFiducia/build/workloads/omb/openmpi-v4.1.6-installer/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_acc_latency 
No protocol specified
No protocol specified
No protocol specified
start av insert
start av insert
start av insert
start av insert
# MPI_Datatype: MPI_INT
# ACC Operation: MPI_SUM, Datatype: MPI_INT.
# OSU MPI_Accumulate latency Test v7.0-lrbison3
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size          Latency (us)
# MPI_Datatype: MPI_INT
# ACC Operation: MPI_SUM, Datatype: MPI_INT.
start av insert
start av insert
start av insert
start av insert
map to region
done map to region
map to region
done map to region
4                       0.21
8                       0.21
16                      0.20
32                      0.21
64                      0.21
128                     0.22
256                     0.22
512                     0.22
1024                    0.24
2048                    0.27
4096                    0.34
8192                    0.45
16384                   0.71
32768                   1.23
65536                   2.20
131072                  4.47
262144                  9.66
524288                 19.08
1048576                37.95
2097152                73.01
4194304               145.30
av remove
av remove
av remove
av remove
start ep close
start ep close
done
done
start ep close
done
start ep close
done
av remove
av remove
av remove
av remove

@zachdworkin
Copy link
Contributor Author

excellent! I suspect that the missing lock was the issue. I will re-push without the prints. Thanks!

prov/shm/src/smr_util.c Show resolved Hide resolved
prov/shm/src/smr_util.c Outdated Show resolved Hide resolved
prov/shm/src/smr_util.c Show resolved Hide resolved
@shijin-aws
Copy link
Contributor

Your latest push failed in build time

NFO     root:utils.py:69 Executing command: make -j

ERROR    root:utils.py:94 Command make -j failed with error:

src/xpmem.c: In function ‘ofi_xpmem_init’:

src/xpmem.c:68:12: warning: variable ‘low’ set but not used [-Wunused-but-set-variable]

  uintptr_t low, high;

            ^~~

prov/shm/src/smr_util.c: In function ‘smr_map_to_endpoint’:

prov/shm/src/smr_util.c:448:17: error: ‘map’ undeclared (first use in this function); did you mean ‘mmap’?

  ofi_spin_held(&map->lock);

                 ^~~

                 mmap

prov/shm/src/smr_util.c:448:17: note: each undeclared identifier is reported only once for each function it appears in

make[1]: *** [prov/shm/src/src_libfabric_la-smr_util.lo] Error 1

make[1]: *** Waiting for unfinished jobs....

make: *** [all] Error 2

and output:

make  all-am

make[1]: Entering directory `/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364-debug-mempoison-asan/source/libfabric'

  CC       util/strerror.o

@zachdworkin
Copy link
Contributor Author

@shijin-aws can you share the aws failure?

@zachdworkin
Copy link
Contributor Author

bot:aws:retest

@shijin-aws
Copy link
Contributor

@zachdworkin This time the segfaults happened again

INFO     root:utils.py:1015 Running on 2 nodes, 36 processes per node
INFO     root:utils.py:69 Executing command: ssh -oStrictHostKeyChecking=no 172.31.87.154 "source /home/ec2-user/PortaFiducia/env/bin/activate; source /etc/profile; cat /sys/class/infiniband/*/ports/*/hw_counters/tx_bytes"
INFO     root:utils.py:69 Executing command: ssh -oStrictHostKeyChecking=no 172.31.87.154 "source /home/ec2-user/PortaFiducia/env/bin/activate; source /etc/profile; cat /sys/class/infiniband/*/ports/*/hw_counters/rx_bytes"
INFO     root:utils.py:69 Executing command: export PATH=/home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/bin:$PATH;export LD_LIBRARY_PATH=/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib;/home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/bin/mpirun --wdir . -n 2 --hostfile /home/ec2-user/PortaFiducia/hostfile --map-by ppr:1:node --timeout 1800 -x LD_LIBRARY_PATH=/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib -x PATH  /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency   2>&1 | tee /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/run/one-sided/osu_get_acc_latency/node2-ppn1.txt
INFO     root:utils.py:618 mpirun output:
# OSU MPI_Get_accumulate latency Test v7.0-lrbison3
# Window creation: MPI_Win_create
# Synchronization: MPI_Win_lock/unlock
# Size          Latency (us)
1                     136.01
2                     135.85
4                     135.63
8                     135.61
16                    135.90
32                    135.94
64                    136.12
128                   136.17
256                   136.40
512                   136.81
1024                  138.07
2048                  140.52
4096                  146.00
8192                  196.13
16384                 298.61
32768                 503.24
65536                1002.94
131072               2000.59
262144               4003.13
524288               8019.07
1048576             16049.95
2097152             32089.13
4194304             64316.38
[ip-172-31-87-154:12661] *** Process received signal ***
[ip-172-31-87-154:12661] Signal: Segmentation fault (11)
[ip-172-31-87-154:12661] Signal code: Address not mapped (1)
[ip-172-31-87-154:12661] Failing at address: 0x7f827c1f6048
[ip-172-31-87-154:12661] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7f828b8528e0]
[ip-172-31-87-154:12661] [ 1] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbd341)[0x7f827e508341]
[ip-172-31-87-154:12661] [ 2] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbd5ec)[0x7f827e5085ec]
[ip-172-31-87-154:12661] [ 3] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbb7bf)[0x7f827e5067bf]
[ip-172-31-87-154:12661] [ 4] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x69fff)[0x7f827e4b4fff]
[ip-172-31-87-154:12661] [ 5] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x6a6fe)[0x7f827e4b56fe]
[ip-172-31-87-154:12661] [ 6] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x6a7d5)[0x7f827e4b57d5]
[ip-172-31-87-154:12661] [ 7] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/openmpi/mca_btl_ofi.so(mca_btl_ofi_finalize+0xbd)[0x7f827e9a819d]
[ip-172-31-87-154:12661] [ 8] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(+0x6fee4)[0x7f828afa4ee4]
[ip-172-31-87-154:12661] [ 9] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(mca_base_framework_close+0x5d)[0x7f828af8cecd]
[ip-172-31-87-154:12661] [10] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(mca_base_framework_close+0x5d)[0x7f828af8cecd]
[ip-172-31-87-154:12661] [11] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libmpi.so.40(ompi_mpi_finalize+0x749)[0x7f828bab0819]
[ip-172-31-87-154:12661] [12] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency[0x4029dc]
[ip-172-31-87-154:12661] [13] /lib64/libc.so.6(__libc_start_main+0xea)[0x7f828b4b513a]
[ip-172-31-87-154:12661] [14] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency[0x402ada]
[ip-172-31-87-154:12661] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-87-154 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
--------------------------------- Captured Out ---------------------------------

@shijin-aws
Copy link
Contributor

@zachdworkin this time it's an efa unit test failure

efa_unit_test: prov/shm/src/smr_util.c:490: smr_unmap_region: Assertion `ofi_spin_held(&map->lock)' failed.
Aborted (core dumped)

Backtrace is

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f5e4b2ac859 in __GI_abort () at abort.c:79
#2  0x00007f5e4b2ac729 in __assert_fail_base (fmt=0x7f5e4b442588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55625afe82d3 "ofi_spin_held(&map->lock)", 
    file=0x55625afe81e6 "prov/shm/src/smr_util.c", line=490, function=<optimized out>) at assert.c:92
#3  0x00007f5e4b2bdfd6 in __GI___assert_fail (assertion=0x55625afe82d3 "ofi_spin_held(&map->lock)", file=0x55625afe81e6 "prov/shm/src/smr_util.c", line=490, 
    function=0x55625afe8510 <__PRETTY_FUNCTION__.16821> "smr_unmap_region") at assert.c:101
#4  0x000055625af00c63 in smr_unmap_region (prov=0x55625b056680 <smr_prov>, map=0x55625be0bcb8, peer_id=0) at prov/shm/src/smr_util.c:490
#5  0x000055625af012e3 in smr_map_unmap (rbmap=0x55625be0bcd8, node=0x55625be93420, context=0x0) at prov/shm/src/smr_util.c:601
#6  0x000055625af1ea4f in ofi_rbmap_foreach (map=0x55625be0bcd8, root=0x55625be93420, func=0x55625af0126c <smr_map_unmap>, context=0x0) at src/tree.c:116
#7  0x000055625afaf0be in smr_map_cleanup (map=0x55625be0bcb8) at prov/shm/src/smr_av.c:72
#8  0x000055625afaf183 in smr_av_close (fid=0x55625be0bbb0) at prov/shm/src/smr_av.c:94
#9  0x000055625af590fc in fi_close (fid=0x55625be0bbb0) at ./include/rdma/fabric.h:641
#10 0x000055625af60cef in efa_av_close (fid=0x55625bbcef28) at prov/efa/src/efa_av.c:821
#11 0x000055625ae4bb27 in fi_close (fid=0x55625bbcef28) at ./include/rdma/fabric.h:641
#12 0x000055625ae4c5dc in efa_unit_test_resource_destruct (resource=0x55625bbbfef0) at prov/efa/test/efa_unit_test_common.c:221
#13 0x000055625ae4abcb in efa_unit_test_mocks_teardown (state=0x55625bbbec30) at prov/efa/test/efa_unit_tests.c:44
#14 0x00007f5e4b4c862f in cmocka_run_one_test_or_fixture () from /home/ubuntu/PortaFiducia/libraries/cmocka/cmocka_install/lib/libcmocka.so.0
#15 0x00007f5e4b4c8982 in cmocka_run_one_tests () from /home/ubuntu/PortaFiducia/libraries/cmocka/cmocka_install/lib/libcmocka.so.0
#16 0x00007f5e4b4c8e33 in _cmocka_run_group_tests () from /home/ubuntu/PortaFiducia/libraries/cmocka/cmocka_install/lib/libcmocka.so.0
#17 0x000055625ae4ae36 in main () at prov/efa/test/efa_unit_tests.c:206

@zachdworkin
Copy link
Contributor Author

@shijin-aws can you share the latest failure? I fixed the previous one

@shijin-aws
Copy link
Contributor

It's the same segfault in MPI run again ....

# Synchronization: MPI_Win_lock/unlock
# Size          Latency (us)
1                     136.28
2                     136.17
4                     136.26
8                     136.25
16                    136.75
32                    136.87
64                    136.95
128                   136.91
256                   137.14
512                   137.74
1024                  139.07
2048                  141.75
4096                  147.91
8192                  199.27
16384                 301.73
32768                 508.15
65536                1011.72
131072               2019.14
262144               4037.29
524288               8081.29
1048576             16182.25
2097152             32382.66
4194304             64891.79
[ip-172-31-37-191:19441] *** Process received signal ***
[ip-172-31-37-191:19441] Signal: Segmentation fault (11)
[ip-172-31-37-191:19441] Signal code: Address not mapped (1)
[ip-172-31-37-191:19441] Failing at address: 0x7f4f8523f004
[ip-172-31-37-191:19441] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7f4f988fe8e0]
[ip-172-31-37-191:19441] [ 1] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbd649)[0x7f4f87551649]
[ip-172-31-37-191:19441] [ 2] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbd91c)[0x7f4f8755191c]
[ip-172-31-37-191:19441] [ 3] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbbabf)[0x7f4f8754fabf]
[ip-172-31-37-191:19441] [ 4] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x6a02f)[0x7f4f874fe02f]
[ip-172-31-37-191:19441] [ 5] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x6a72e)[0x7f4f874fe72e]
[ip-172-31-37-191:19441] [ 6] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x6a805)[0x7f4f874fe805]
[ip-172-31-37-191:19441] [ 7] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/openmpi/mca_btl_ofi.so(mca_btl_ofi_finalize+0xbd)[0x7f4f879f119d]
[ip-172-31-37-191:19441] [ 8] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(+0x6fee4)[0x7f4f98050ee4]
[ip-172-31-37-191:19441] [ 9] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(mca_base_framework_close+0x5d)[0x7f4f98038ecd]
[ip-172-31-37-191:19441] [10] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(mca_base_framework_close+0x5d)[0x7f4f98038ecd]
[ip-172-31-37-191:19441] [11] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libmpi.so.40(ompi_mpi_finalize+0x749)[0x7f4f98b5c819]
[ip-172-31-37-191:19441] [12] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency[0x4029dc]
[ip-172-31-37-191:19441] [13] /lib64/libc.so.6(__libc_start_main+0xea)[0x7f4f9856113a]
[ip-172-31-37-191:19441] [14] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency[0x402ada]
[ip-172-31-37-191:19441] *** End of error message ***
--------------------------------------------------------------------------

@zachdworkin
Copy link
Contributor Author

It's the same segfault in MPI run again ....

# Synchronization: MPI_Win_lock/unlock
# Size          Latency (us)
1                     136.28
2                     136.17
4                     136.26
8                     136.25
16                    136.75
32                    136.87
64                    136.95
128                   136.91
256                   137.14
512                   137.74
1024                  139.07
2048                  141.75
4096                  147.91
8192                  199.27
16384                 301.73
32768                 508.15
65536                1011.72
131072               2019.14
262144               4037.29
524288               8081.29
1048576             16182.25
2097152             32382.66
4194304             64891.79
[ip-172-31-37-191:19441] *** Process received signal ***
[ip-172-31-37-191:19441] Signal: Segmentation fault (11)
[ip-172-31-37-191:19441] Signal code: Address not mapped (1)
[ip-172-31-37-191:19441] Failing at address: 0x7f4f8523f004
[ip-172-31-37-191:19441] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7f4f988fe8e0]
[ip-172-31-37-191:19441] [ 1] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbd649)[0x7f4f87551649]
[ip-172-31-37-191:19441] [ 2] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbd91c)[0x7f4f8755191c]
[ip-172-31-37-191:19441] [ 3] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0xbbabf)[0x7f4f8754fabf]
[ip-172-31-37-191:19441] [ 4] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x6a02f)[0x7f4f874fe02f]
[ip-172-31-37-191:19441] [ 5] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x6a72e)[0x7f4f874fe72e]
[ip-172-31-37-191:19441] [ 6] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10364/install/libfabric/lib/libfabric.so.2(+0x6a805)[0x7f4f874fe805]
[ip-172-31-37-191:19441] [ 7] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/openmpi/mca_btl_ofi.so(mca_btl_ofi_finalize+0xbd)[0x7f4f879f119d]
[ip-172-31-37-191:19441] [ 8] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(+0x6fee4)[0x7f4f98050ee4]
[ip-172-31-37-191:19441] [ 9] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(mca_base_framework_close+0x5d)[0x7f4f98038ecd]
[ip-172-31-37-191:19441] [10] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libopen-pal.so.40(mca_base_framework_close+0x5d)[0x7f4f98038ecd]
[ip-172-31-37-191:19441] [11] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x/install/lib/libmpi.so.40(ompi_mpi_finalize+0x749)[0x7f4f98b5c819]
[ip-172-31-37-191:19441] [12] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency[0x4029dc]
[ip-172-31-37-191:19441] [13] /lib64/libc.so.6(__libc_start_main+0xea)[0x7f4f9856113a]
[ip-172-31-37-191:19441] [14] /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.7rc1-v4.1.x/install/libexec/osu-micro-benchmarks/mpi/one-sided/osu_get_acc_latency[0x402ada]
[ip-172-31-37-191:19441] *** End of error message ***
--------------------------------------------------------------------------

Can you run this again with debug logging? I can't reproduce this issue with my ompi workaround patch

@shijin-aws
Copy link
Contributor

shijin-aws commented Oct 16, 2024

Backtrace with debug build

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f4547b07420 in smr_unmap_region (prov=0x7f4547bfb260 <smr_prov>, map=0x55c1c703b378, peer_id=0) at prov/shm/src/smr_util.c:506
506             if (peer_region->pid == getpid())
[Current thread is 1 (Thread 0x7f454d773fc0 (LWP 1545539))]
(gdb) bt
#0  0x00007f4547b07420 in smr_unmap_region (prov=0x7f4547bfb260 <smr_prov>, map=0x55c1c703b378, peer_id=0) at prov/shm/src/smr_util.c:506
#1  0x00007f4547b079e1 in smr_map_unmap (rbmap=0x55c1c703b398, node=0x55c1c715b490, context=0x0) at prov/shm/src/smr_util.c:601
#2  0x00007f4547b07acf in smr_map_del (map=0x55c1c703b378, shm_id=0) at prov/shm/src/smr_util.c:619
#3  0x00007f4547b04800 in smr_av_remove (av_fid=0x55c1c703b270, fi_addr=0x7f4537e33258, count=1, flags=0) at prov/shm/src/smr_av.c:220
#4  0x00007f4547a69a6b in fi_av_remove (av=0x55c1c703b270, fi_addr=0x7f4537e33258, count=1, flags=0) at ./include/rdma/fi_domain.h:531
#5  0x00007f4547a6d38b in efa_conn_rdm_deinit (av=0x55c1c70371e0, conn=0x7f4537e33220) at prov/efa/src/efa_av.c:358
#6  0x00007f4547a70a69 in efa_conn_release (av=0x55c1c70371e0, conn=0x7f4537e33220) at prov/efa/src/efa_av.c:555
#7  0x00007f4547a714b4 in efa_av_close_reverse_av (av=0x55c1c70371e0) at prov/efa/src/efa_av.c:794
#8  0x00007f4547a71591 in efa_av_close (fid=0x55c1c7037228) at prov/efa/src/efa_av.c:811
#9  0x00007f454c0c1000 in mca_btl_ofi_finalize () from /opt/amazon/openmpi/lib/openmpi/mca_btl_ofi.so
#10 0x00007f454d8fcddd in ?? () from /opt/amazon/openmpi/lib/libopen-pal.so.40
#11 0x00007f454d8e488c in mca_base_framework_close () from /opt/amazon/openmpi/lib/libopen-pal.so.40
#12 0x00007f454d8e488c in mca_base_framework_close () from /opt/amazon/openmpi/lib/libopen-pal.so.40
#13 0x00007f454dc46265 in ompi_mpi_finalize () from /opt/amazon/openmpi/lib/libmpi.so.40
#14 0x000055c1c603cd39 in main (argc=<optimized out>, argv=<optimized out>) at osu_get_acc_latency.c:115

@shijin-aws
Copy link
Contributor

shijin-aws commented Oct 16, 2024

The segfault happens when accessing pid in the peer_region

(gdb) p peer_region
$1 = (struct smr_region *) 0x7f4544175000
(gdb) p peer_region->pid
Cannot access memory at address 0x7f4544175004

Is peer_region freed or corrupted earlier?

This function is mainly for the niche case where on progress_connreq
a peer is added to the map with its region needing to be mapped, and
then after mapping it, it's discovered that the newly mapped peer's
process died. In this case we need to unmap them and free any resources
that were opened for communicating with them.

Remove lock from map_to_region and unmap_region functions and require
lock acquirement before calling those functions. This is necessary because
on av removal path, map will be double locked if the functions also process
locking the map. The map_to_region function is updated to mirror this
policy.

Signed-off-by: Zach Dworkin <[email protected]>
@j-xiong j-xiong merged commit 331b425 into ofiwg:main Oct 24, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants