You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pradeep was seeing sig 9 termination on Summit with servers when running with 64GB/ts/node (looked like the 3rd ts, so either 192 or 256GB storage.) This is earlier than expected since Summit has 512GB/node. This might be a system limitation, but we should confirm that we aren't doing something silly like allocating 2x memory (or failing to signal to libfabric or margo to release buffers) or similar silliness.
Useful job script excerpt:
# disable MR cache in libfabric; still problematic as of libfabric 1.4.1
export FI_MR_CACHE_MAX_COUNT=0
# use shared recv context in RXM; should improve scalability
export FI_OFI_RXM_USE_SRX=1
rm -rf conf.ds
## Create dataspaces configuration file
echo "## Config file for DataSpaces
ndim = 3
dims = 32768, 32768, 32768
max_versions = 4
max_readers = 128
lock_type = 2
num_apps = 2
" > dataspaces.conf
TD=32768
NS=2
NW=64
NR=4
let "DR=TD/NR"
let "DW=TD/NW"
# Note that we explicitly specify the libfabric domain of "mlx5_0" on
# Summit. Otherwise libfabric and/or libibverbs may select a default port
# that does not work out of the box.
jsrun -n $NS -a 1 -r $NS ./dspaces_server verbs://mlx5_0 >& server_"$NS"_"$NW"_"$NR".log &
serverproc=$!
jsrun -n $NW -a 1 -r 32 ./test_writer 3 1 1 $NW 512 512 $DW 4 -s 8 -m server >& writer_"$NS"_"$NW"_"$NR".log
writerproc=$!
jsrun -n $NR -a 1 -r $NR ./test_reader 3 1 1 $NR 512 512 $DR 4 -s 8 -t >& reader_"$NS"_"$NW"_"$NR".log
wait $writerproc
wait $serverproc
The text was updated successfully, but these errors were encountered:
Pradeep was seeing sig 9 termination on Summit with servers when running with 64GB/ts/node (looked like the 3rd ts, so either 192 or 256GB storage.) This is earlier than expected since Summit has 512GB/node. This might be a system limitation, but we should confirm that we aren't doing something silly like allocating 2x memory (or failing to signal to libfabric or margo to release buffers) or similar silliness.
Useful job script excerpt:
The text was updated successfully, but these errors were encountered: