Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pre-g0: mismatched send-recv warnings at end of execution #179

Open
cwsmith opened this issue Apr 11, 2024 · 0 comments
Open

pre-g0: mismatched send-recv warnings at end of execution #179

cwsmith opened this issue Apr 11, 2024 · 0 comments

Comments

@cwsmith
Copy link

cwsmith commented Apr 11, 2024

Hello,

On the Purdue Anvil system (cpus only) I'm hitting OpenMPI UCX (infiniband) warnings at the end of execution of the vm-tsw-2x2v.lua example here (https://gkyl.readthedocs.io/en/latest/quickstart/inputFiles/vm-tsw-2x2v.html).

The change to run vm-tsw-2x2v.lua on multiple ranks, the output from execution including the warnings, and the job submission scripts are pasted below.

A quick search lead me to this github issue:

openucx/ucx#6331 (comment)

which indicates that some messages that were sent were not received before MPI_Finalize was called.

Note, I also hit these warnings on SDSC Expanse. In both cases the system install of OpenMPI was used.

change to run with multiple ranks

[email protected]:[quickstart] $ diff vm-tsw-2x2v.lua vm-tsw-2x2v_orig.lua
101c101
<    decompCuts = {5,2},                      -- Cuts in each configuration direction
---
>    decompCuts = {1,1},                      -- Cuts in each configuration direction

output

Wed Apr 10 2024 20:21:49.000000000
Gkyl built with 9663ea594e80+
Gkyl built on Apr 10 2024 16:08:26
Initializing Vlasov-Maxwell simulation ...
Initialization completed in 1.01092 sec

Starting main loop of Vlasov-Maxwell simulation ...

 Step 0 at time 0. Time step 0.0360652. Completed 0%
0123456789 Step    139 at time    5.0130661.  Time step  3.606522e-02.  Completed 10%
0123456789 Step    278 at time    10.026132.  Time step  3.606522e-02.  Completed 20%
0123456789 Step    416 at time    15.003133.  Time step  3.606522e-02.  Completed 30%
0123456789 Step    555 at time    20.016199.  Time step  3.606522e-02.  Completed 40%
0123456789 Step    694 at time    25.029265.  Time step  3.606522e-02.  Completed 50%
0123456789 Step    832 at time    30.006266.  Time step  3.606522e-02.  Completed 60%
0123456789 Step    971 at time    35.019332.  Time step  3.606522e-02.  Completed 70%
0123456789 Step   1110 at time    40.032398.  Time step  3.606522e-02.  Completed 80%
0123456789 Step   1248 at time    45.009399.  Time step  3.606522e-02.  Completed 90%
0123456789 Step   1387 at time    50.000000.  Time step  3.606522e-02.  Completed 100%
0

Total number of time-steps 1388
   Number of forward-Euler calls 5548
   Number of RK stage-2 failures 0
   Number of RK stage-3 failures 0
Solver took                                  102.58454 s   ( 0.073908 s/step)   (75.273%)
Solver BCs took                                8.60794 s   ( 0.006202 s/step)   ( 6.316%)
Field solver took                              1.14156 s   ( 0.000822 s/step)   ( 0.838%)
Field solver BCs                               0.27180 s   ( 0.000196 s/step)   ( 0.199%)
Function field solver took                     0.00000 s   ( 0.000000 s/step)   ( 0.000%)
Moment calculations took                       8.34645 s   ( 0.006013 s/step)   ( 6.124%)
Integrated moment calculations took            5.64078 s   ( 0.004064 s/step)   ( 4.139%)
Field energy calculations took                 0.04717 s   ( 0.000034 s/step)   ( 0.035%)
Collision solver(s) took                       0.00000 s   ( 0.000000 s/step)   ( 0.000%)
Collision (other) took                         0.00000 s   ( 0.000000 s/step)   ( 0.000%)
Source updaters took                           0.00000 s   ( 0.000000 s/step)   ( 0.000%)
Stepper combine/copy took                      3.44240 s   ( 0.002480 s/step)   ( 2.526%)
Forward Euler combine took                     0.00000 s   ( 0.000000 s/step)   ( 0.000%)
Time spent in barrier function                 0.21111 s   ( 0.000152 s/step)   ( 0.155%)
Data write took                                5.74470 s   ( 0.004139 s/step)   ( 4.215%)
Write restart took                             0.02960 s   ( 0.000021 s/step)   ( 0.022%)
[Unaccounted for]                              6.19993 s   ( 0.004467 s/step)   ( 4.549%)

Main loop completed in                       136.28258 s   ( 0.098186 s/step)   (   100%)

Wed Apr 10 2024 20:24:07.000000000
[1712795047.136074] [a877:220516:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x1aaadc0 was not matched
[1712795047.136109] [a877:220516:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x1aab0c0 was not matched
<.... snip ....  UCX  WARN  unexpected tag-receive appears 21 times>

job scripts

slurm submission script

[email protected]:[quickstart] $ cat submit.sh 
#!/bin/bash -ex 
opts="[email protected] --mail-type=ALL"
sbatch $opts -p wholenode -A ##### -n 10 -N 1 -t 10 ./vmTsw.sh
[email protected]:[quickstart] $ cat vmTsw.sh 

run script

#!/bin/bash
gkyl=/anvil/projects/########/cws/gkyllPreG0Dev/gkeyllSoftCpu/bin/gkyl
srun -n ${SLURM_NPROCS} $gkyl vm-tsw-2x2v.lua

two small changes to gkyl pre-g0

One was for the adios url, pre #178 being merged, and another was to pick up the correct python version.

x-cwsmith@login07: /anvil/projects/x-phy220105/cws/gkyllPreG0Dev/gkyl (pre-g0)$ git diff
diff --git a/install-deps/build-adios.sh b/install-deps/build-adios.sh
index 280082b8..49f5129d 100755
--- a/install-deps/build-adios.sh
+++ b/install-deps/build-adios.sh
@@ -8,7 +8,7 @@ PREFIX=$GKYLSOFT/adios-1.13.1
 # delete old checkout and builds
 rm -rf adios-1.13.1.tar* adios-1.13.1

-curl -L http://users.nccs.gov/~pnorbert/adios-1.13.1.tar.gz > adios-1.13.1.tar.gz
+curl -L https://users.nccs.gov/~pnorbert/adios-1.13.1.tar.gz > adios-1.13.1.tar.gz
 gunzip adios-1.13.1.tar.gz
 tar -xvf adios-1.13.1.tar
 cd adios-1.13.1
diff --git a/waf b/waf
index 7ceee167..df825656 100755
--- a/waf
+++ b/waf
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3.12
 # encoding: latin-1
 # Thomas Nagy, 2005-2018
 #
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant