-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathNOTES
219 lines (189 loc) · 10.2 KB
/
NOTES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
(failed) attempt to disable atomics
+mpi_args=(
+ --mca osc_rdma_acc_use_amo false
+)
USC Discovery cluster usage snapshot showing slowdown in prefix build job:
https://snapshot.raintank.io/dashboard/snapshot/cEAWF2Rs3iqrI5HZd7oS96b12fINZqR9
enable debug output in PMIX:
PMIX_DEBUG=5
(and set --enable-debug, i.e. debug use flag)
Discovery: mpirun args for tcp over Infiniband IP interface
psbatch ~/ca2/casper/gpref/gp-broadwell discovery broadwell "" all 2 16 00:15:00 mpirun -n 2 --map-by node --mca btl_base_verbose 100 --mca pml_base_verbose 100 --mca pml_ucx_verbose 100 --mca btl self,tcp --mca pml_ucx_priority 0 --mca btl_tcp_if_include ib0 python $PWD/fch.py 32 test.csv
OpenMPI OSU benchmarks: TCP (within prefix) vs native (uGNI):
(NOTE: similar results with openmpi built with --debug and without)
Profiled on debug queue for 2 nodes
latency: 3-5x slower
bandwidth: 5-10x less (tcp: ~1000 MB/s max)
uGNI (within prefix) vs native (uGNI):
latency: prefix is up to 40% lower (good) for smaller msg sizes
bandwidth: prefix is up to 20% less (bad) for smaller msg sizes
About SLURM + OpenMPI:
----------------------
There are two very different ways of launching an MPI binary (i.e. binary linked against libopenmpi.so):
A. srun ./mybinary
B. mpirun ./mybinary
Case A is when SLURM is in charge of loading and managing the processes, and
establishes a channel between the manager and the libopenmpi.so inside the
binary, which is used by a process (one of the loaded binaries) to query the
manager about the platform; the protocol of this channel is PMI{1,2,x}, and
HPCC's SLURM is built only with PMI{1,2} enabled (although its version is new
enough to enable PMIx, it's just not enabled). To make this channel, the
libopenmpi.so needs to be built/linked against libpmi{,2}.so provided by the
respective manager (SLURM, in this case). I did all that, and srun ./mybinary
does work on hello world, and does start fenics (the later segfault in fenics
is somehow related to this MPI infrastructure (because doesn't happen in case
B), but it's not because of the channel not working at all; it might be because
SLURM and my build of libpmi2.so (from same SLURM versions) were built with
very different GCC versions. In any case, we don't care about this segfault,
nor about case A beyond curiousity, because there is case B.
Case B is when the 'orted" runtime system provided by OpenMPI (invoked by
mpirun tool) is in charge of launching and managing the processes from MPI
binaries. So, the big question was: is SLURM still involved and if so, is
libopenmpi.so talking to SLURM via PMI? The answers are YES to first, but NO to
second. SLURM is (and has to) be involved (beyond just passing a nodelist in
environment) because it's the only daemon running on remote nodes (SSH is
running but is not accessible/usable), so without SLURM there is no way to
request that anything at all happens on a remote node. So, libopenmpi.so reads
the hostlist from environment (passed by SLURM) and then launches its own orted
deamon on each remote hosts by invoking via fork+exec essentially: srun
--nodelist /usr/bin/orted callback-ip=MY_IP' . So, openmpi bootstraps a cluster
managed by itself (still respecting SLURM's allocation, because srun would
refuse a request outside of the allocation); and this boostraping is what
--with-slurm in openmpi build config controls (utterly undocumented), and how
it relates to --with-pmi=libpmi2(slurm).so. Once bootstrapped, the MPI binary
(application) talks to mpirun(orted) not to SLURM -- that's the essential
difference from Case A. The protocol happens to be PMIx, and libopenmpi.so has
to be (and is by default) linked with libpmix.so, but this is orthogonal to
linking libopenmpi.so with libpmi2.so as needed in Case A; the PMI channel in
case B doesn't involve SLURM: it's a channel between the app binary and
mpirun(orted), so the PMI version that SLURM supports is not relevant.
SLURM config file on USC HPCC:
/var/spool/slurm/slurm.conf
mpirun -mca plm_base_verbose 9 -n 2 -N 1 ./mpitest
mpirun -mca oob_base_verbose 9 -n 2 -N 1 ./mpitest
for node allocation detection (ALPS and SLURM):
--mca ras_base_verbose 100
--mca plm_base_verbose 100
# # To run htop
# export SCREEN_DIR=screen/$(hostname)
# mkdir -p -m700 $SCREEN_DIR
# screen
slurm
* code that parses --constraint
src/slurmctld/job_mgr.c: build_feature_list
TODO: OpenMPI: paths in error message about PMI are broken
checking if user requested PMI support... yes
checking for pmi.h in /usr/include/slurm... found
checking pmi.h usability... yes
checking pmi.h presence... yes
checking for pmi.h... yes
checking for libpmi in /usr... not found
checking for pmi2.h in /usr/include/slurm... found
checking pmi2.h usability... yes
checking pmi2.h presence... yes
checking for pmi2.h... yes
checking for libpmi2 in /usr... not found
checking for pmix.h in /usr/include/slurm... not found
checking for pmix.h in /usr/include/slurm/include... not found
checking can PMI support be built... no
configure: WARNING: PMI support requested (via --with-pmi) but neither pmi.h,
configure: WARNING: pmi2.h or pmix.h were found under locations:
configure: WARNING: /usr/include/slurm
configure: WARNING: /usr/include/slurm/slurm
configure: WARNING: Specified path: /usr/include/slurm
configure: WARNING: OR neither libpmi, libpmi2, or libpmix were found under:
configure: WARNING: /usr/lib
configure: WARNING: /usr/lib64
configure: WARNING: Specified path: /usr
configure: error: Aborting
UCX
[nid03831:97476:0:97476] ugni_udt_ep.c:204 Assertion `msg_length <= 128' failed: msg_length=148
/lus/theta-fs0/projects/CASPER/acolin/gpref/gp-knl/var/tmp/portage/sys-cluster/ucx-1.9.0/work/ucx-1.9.0/src/uct/ugni/udt/ugni_udt_ep.c: [ uct_ugni_udt_ep_am_common_send() ]
...
201 sheader->type = UCT_UGNI_UDT_PAYLOAD;
202
203 ucs_assertv(msg_length <= GNI_DATAGRAM_MAXSIZE, "msg_length=%u", msg_length);
==> 204
205 uct_ugni_cdm_lock(&iface->super.cdm);
206 ugni_rc = GNI_EpPostDataWId(ep->super.ep,
207 sheader, msg_length,
if use cray_pmi; then
local host_pc_path="/usr/$(get_libdir)/pkgconfig"
local pmi_path=$($(tc-getPKG_CONFIG) \
--with-path="${host_pc_path}" --cflags-only-I \
cray-pmi | sed 's/.*-I\(\S*pmi\S*\).*/\1/g')
local pmi_libdir=$($(tc-getPKG_CONFIG) \
--with-path="${host_pc_path}" --libs-only-L \
cray-pmi | sed 's/-L//g')
elif use pmi; then
local pmi_path="${EPREFIX}/usr/include"
local pmi_libdir="${EPREFIX}/usr/$(get_libdir)"
fi
if use pmi; then
local pmi_path pmi_libdir
# If have pmi.pc, then use it, otherwise, default location.
# The .pc files may point to outside of EPREFIX in the case
# when using proprietary binary libs installed in the host.
if $(tc-getPKG_CONFIG) --exists pmi; then
# ./configure wants one include path and one
# library path but pkg-config gives a set of
# include/link flags. If we do get many paths,
# then, pick the one with 'pmi' (this is the case
# for sys-cluster/cray-libs), otherwise, just grab
# the first one.
local pmi_incs pmi_lds
pmi_incs="$($(tc-getPKG_CONFIG) --cflags-only-I pmi)"
pmi_lds="$($(tc-getPKG_CONFIG) --libs-only-L pmi)"
if [[ "${pmi_incs}" =~ pmi ]]; then
pmi_path="$(echo "${pmi_incs}" \
| sed 's/.*-I\(\S*pmi\S*\).*/\1/g')"
pmi_path="$(echo "${pmi_lds}" \
| sed 's/.*-L\(\S*pmi\S*\).*/\1/g')"
else
pmi_path="$(echo "${pmi_incs}" \
| sed 's/\s*-I\(\S\+\)\s*/\1/')"
pmi_libdir=$(echo "${pmi_incs}" \
| sed 's/\s*-L\(\S\+\)\s*/\1/')
fi
else
pmi_path="${EPREFIX}/usr/include"
pmi_libdir="${EPREFIX}/usr/lib64"
fi
fi
# TODO: libfabric linked against these, but when we link
# against libfabric, we fail. Use 'rpath' when building libfabric?
if use openmpi_fabrics_ofi; then
local gni_libs=(cray-ugni cray-xpmem cray-alpsutil cray-alpslli
cray-udreg cray-wlm_detect)
$(tc-getPKG_CONFIG) --exists ${gni_libs[@]} || die
# TODO: the -L is extra-workaround... somehow the build
# is not looking for -lpmi in path passed to
# --with-pmi-libdir
local gni_ldflags="$($(tc-getPKG_CONFIG) \
--libs-only-L ${gni_libs[@]} \
| sed 's/-L\(\S\+\)/-L\1 -Wl,-rpath-link -Wl,\1/g')"
local gni_cflags="$($(tc-getPKG_CONFIG) \
--cflags-only-I ${gni_libs[@]})"
fi
Closing of FDs in OpenMPI's ODLS breaks OFI-with-uGNI transport
[nid03833:216246] mca:base:select: Auto-selecting mtl components
[nid03833:216246] mca:base:select:( mtl) Querying component [ofi]
[nid03833:216246] mca:base:select:( mtl) Query of component [ofi] set priority to 25
[nid03833:216246] mca:base:select:( mtl) Selected component [ofi]
[nid03833:216246] select: initializing mtl component ofi
[nid03838:156776] /tmp/acolin-gp-knl/portage/sys-cluster/openmpi-4.0.3-r1/work/openmpi-4.0.3/ompi/mca/mtl/ofi/mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[nid03838:156776] /tmp/acolin-gp-knl/portage/sys-cluster/openmpi-4.0.3-r1/work/openmpi-4.0.3/ompi/mca/mtl/ofi/mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[nid03838:156776] /tmp/acolin-gp-knl/portage/sys-cluster/openmpi-4.0.3-r1/work/openmpi-4.0.3/ompi/mca/mtl/ofi/mtl_ofi_component.c:347: mtl:ofi:prov: gni
[nid03833:216246] /tmp/acolin-gp-knl/portage/sys-cluster/openmpi-4.0.3-r1/work/openmpi-4.0.3/ompi/mca/mtl/ofi/mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[nid03833:216246] /tmp/acolin-gp-knl/portage/sys-cluster/openmpi-4.0.3-r1/work/openmpi-4.0.3/ompi/mca/mtl/ofi/mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[nid03833:216246] /tmp/acolin-gp-knl/portage/sys-cluster/openmpi-4.0.3-r1/work/openmpi-4.0.3/ompi/mca/mtl/ofi/mtl_ofi_component.c:347: mtl:ofi:prov: gni
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: nid03838
Location: /tmp/acolin-gp-knl/portage/sys-cluster/openmpi-4.0.3-r1/work/openmpi-4.0.3/ompi/mca/mtl/ofi/mtl_ofi_component.c:629
Error: Invalid argument (22)
--------------------------------------------------------------------------
[nid03838:156776] select: init returned failure for component ofi
[nid03838:156776] select: no component selected
[nid03838:156776] select: init returned failure for component cm