Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest flux-core cannot be bootstrapped on CORAL #2684

Closed
dongahn opened this issue Jan 28, 2020 · 21 comments
Closed

Latest flux-core cannot be bootstrapped on CORAL #2684

dongahn opened this issue Jan 28, 2020 · 21 comments

Comments

@dongahn
Copy link
Member

dongahn commented Jan 28, 2020

I am helping @JaeseungYeom to start to test and explore his flux-dyad on LBANN on Lassen. It seems the latest and greatest flux-core version fails to be bootstrapped when we used the PMIx workaround Stephen put together for other workflow teams like MuMMI.

lassen708{dahn}23: bsub -Is -nnodes 4 -XF /usr/bin/tcsh
<CUT>
lassen348{dahn}21: module load hwloc/1.11.10-cuda
lassen348{dahn}28: env PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 4 /usr/global/tools/flux/blueos_3_ppc64le_ib/flux-0.14.x-20200127/bin/flux start ip.sh
flux-broker: broker_pmi_get_params: invalid length argument
flux-broker: bootstrap failed

@JaeseungYeom and I confirmed that the workaround still works with the older flux version that MuMMI team is using (flux-0.11.x-20190425). But @JaeseungYeom needs the latest and greatest version to make progress. In fact, @jameshcorbett will need the same version for his UPQ work as well.

My guess is the keys being used by our PMI library has changed in the latest and greatest such that the work around doesn't work any longer...

Any idea as to how to fix this quickly?

@grondo
Copy link
Contributor

grondo commented Jan 28, 2020

Not sure I'm the right person to help here, but can you try with PMI_DEBUG=1 set in the environment and see if any extra clue pops out?

@garlick
Copy link
Member

garlick commented Jan 28, 2020

broker_pmi_get_params() passes in a fixed buffer of 64 bytes to PMI_KVS_Get_my_name(). Maybe it is failing because that's not enough? It might be worth changing the fixed size in src/broker/pmiutil.h to say 1024 and see what happens, e.g.

diff --git a/src/broker/pmiutil.h b/src/broker/pmiutil.h
index fe188d91d..d9cffe781 100644
--- a/src/broker/pmiutil.h
+++ b/src/broker/pmiutil.h
@@ -14,7 +14,7 @@
 struct pmi_params {
     int rank;
     int size;
-    char kvsname[64];
+    char kvsname[1024];
 };

 struct pmi_handle;

If that doesn't help and traces say that's where it's failing then we may need to make that code request the length from the server and dynamically allocate the buffer.

@dongahn
Copy link
Member Author

dongahn commented Jan 28, 2020

Thanks @grondo and @garlick. I will take a look.

So the keys being used by our PMI library have not changed recently? If I remember right, @SteVwonder parsed these keys and fetch the rank fields to work around an PMIx bug within his patched pmi4pmix.

@garlick
Copy link
Member

garlick commented Jan 28, 2020

So the keys being used by our PMI library have not changed recently?

The error you quoted above suggests the broker is failing in PMI_KVS_Get_my_name() with PMI_ERR_INVALID_LENGTH. This failure is early, before key exchange.

The PMI code in the broker was rewritten since 0.11. The old code called PMI_KVS_Get_name_length_max(), allocated a buffer of the returned size, and used that in PMI_KVS_Get_my_name(). The new code uses a 64 byte static buffer.

I'm assuming the problem is that the CORAL PMI wants to return a long KVS name and can't. Please try the quick fix above and if that works, then we can change the title of this bug to reflect the changes needed in the broker.

@dongahn
Copy link
Member Author

dongahn commented Jan 29, 2020

I applied the patch from #2684 (comment) but this version of flux now hangs. Throwing in FLUX_PMI_DEBUG=1 doesn't print out much...

lassen617{dahn}25: env FLUX_PMI_DEBUG=1 PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 4 /usr/global/tools/flux/blueos_3_ppc64le_ib/flux-0.14.x-20200127/bin/flux start ~/ip.sh
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen

@garlick
Copy link
Member

garlick commented Jan 29, 2020

Sorry, the broker PMI code doesn't really have much instrumentation since it doesn't go through our PMI library anymore. You might try adding -o,-v to the flux-start command line. If the boot: rank=N size=M message is printed on stderr then we're past PMI. If not, then running the broker under ltrace may help determine where we got stuck.

This is the command line I used to trace PMI usage of an MPI hello world program under slurm:

$    srun --label -N2 ltrace -tt -l 'libpmi*' -F ltrace.conf t/mpi/hello

use this ltrace.conf.

Alternatively, there may be a PMIX_MCA_* environment variable that can be set to enable debugging from the PMIx library. See:

https://pmix.org/support/faq/debugging-pmix/

@garlick
Copy link
Member

garlick commented Jan 29, 2020

I should have added: I think we are getting past broker_pmi_get_params() now, since all it does is get rank, size, and kvsname, and we were definitely getting stuck before due to the pmix PMI_KVS_Get_my_name implementation requiring the buffer size to be at least PMIX_MAX_NSLEN (255).

@grondo
Copy link
Contributor

grondo commented Jan 29, 2020

This is the command line I used to trace PMI usage of an MPI hello world program under slurm:

That ltrace.conf is a nice feature. Assuming we might be hitting problems like this often, we should install that file somewhere and offer a canned script that could be used with e.g. flux start --wrap for easier debugging now that we don't have internal tracing of the broker PMI client.

@dongahn
Copy link
Member Author

dongahn commented Jan 30, 2020

Here is the ltrace output leading to the hang. I just ran this with one node.

ltrace.out.zip

@dongahn
Copy link
Member Author

dongahn commented Jan 30, 2020

gingerfoot:~/Desktop] ahn1% cat ltrace.out | grep PMIx
17:48:44.355325 libpmi.so->PMIx_Init(0x200000180140, 0, 0, 0 <unfinished ...>
17:48:46.099394 libpmix.so.2->PMIx_Get(0x7fffffff11a0, 0x200000efa7c0, 0x7fffffff0d40, 1 <unfinished ...>
17:48:46.100672 libpmix.so.2->PMIx_Get_nb(0x7fffffff11a0, 0x200000efa7c0, 0x7fffffff0d40, 1 <unfinished ...>
17:48:46.101199 <... PMIx_Get_nb resumed> )      = 0
17:48:46.103947 <... PMIx_Get resumed> )         = -46
17:48:46.103990 <... PMIx_Init resumed> )        = 0
17:48:46.104492 libpmi.so->PMIx_Get(0x7fffffff14b0, 0x20000016a91c, 0x7fffffff15c0, 1 <unfinished ...>
17:48:46.105580 libpmix.so.2->PMIx_Get_nb(0x7fffffff14b0, 0x20000016a91c, 0x7fffffff15c0, 1 <unfinished ...>
17:48:46.106097 <... PMIx_Get_nb resumed> )      = 0
17:48:46.108837 <... PMIx_Get resumed> )         = -46
17:48:46.109348 libpmi.so->PMIx_Get(0x7fffffff14d0, 0x20000016a8bc, 0x7fffffff15e0, 1 <unfinished ...>
17:48:46.113160 <... PMIx_Get resumed> )         = 0
17:48:46.113206 libpmi.so->PMIx_Get(0x7fffffff1700, 0x1002ef40, 0, 0 <unfinished ...>

@grondo
Copy link
Contributor

grondo commented Jan 30, 2020

instead of running ltrace on flux start you may have to run it on the broker itself with flux start --wrap e.g.

$ srun -l -N2 --mpi=none --mpibind=off flux start -v --wrap=ltrace,-tt,-l,\'libpmi*\',-F,src/common/libpmi/ltrace.conf sleep 1
1: flux-start: ltrace -tt -l 'libpmi*' -F src/common/libpmi/ltrace.conf /usr/libexec/flux/cmd/flux-broker sleep 1
0: flux-start: ltrace -tt -l 'libpmi*' -F src/common/libpmi/ltrace.conf /usr/libexec/flux/cmd/flux-broker sleep 1

However, no matter what I do I can't get any PMI_* trace out of the broker with ltrace. Maybe we're confusing it with the way we're using dlopen/dlsym? (ltrace does seem to support dlopen() though...)

@garlick
Copy link
Member

garlick commented Jan 30, 2020

I had the same problem on fluke towards the end of the day and figured it must be me!

@dongahn: is the PMIx output from setting mca debug?

This is probably naive, but the last call is a PMIx_Get() that says "unfinished". Other "unfinished" calls have a corresponding "resumed". So are we waiting for a response from the pmix server perhaps?

@garlick
Copy link
Member

garlick commented Jan 30, 2020

Following up on our discussion in todays meeting, the broker's PMI sequence is as follows:

  1. PMI_Init()
  2. PMI_Get_size(), PMI_Get_rank(), PMI_KVS_Get_my_name()
  3. PMI_KVS_Put() cmbd.<rank>.uri (all except leaf nodes)
  4. PMI_KVS_Commit(), PMI_Barrier()
  5. PMI_KVS_Get() cmbd.<parent_rank>/uri (all except TBON root)
  6. PMI_Barrier()
  7. PMI_Finalize()

So key names haven't changed since 0.11.

One thing that's changed though: in 0.11, all ranks put a uri; in current code, the TBON leaf nodes do not (since nothing connects to them).

@dongahn
Copy link
Member Author

dongahn commented Jan 30, 2020

One thing that's changed though: in 0.11, all ranks put a uri; in current code, the TBON leaf nodes do not (since nothing connects to them).

What is the commit where this change was made? I can maybe try that version to see if it works.

@garlick
Copy link
Member

garlick commented Jan 30, 2020

That was in 2e3a051 which was part of #2578, merged at f9780be.

@dongahn
Copy link
Member Author

dongahn commented Feb 15, 2020

Good news: Adding PMIX_MCA_gds="^ds12,ds21" to use the basic hash within PMIx works around this issue.

@JaeseungYeom and @jameshcorbett: you can make some progress using the version I installed below.

rzansel62{dahn}25: env FLUX_PMI_DEBUG=1 PMIX_MCA_gds="^ds12,ds21" PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 4 /usr/global/tools/flux/blueos_3_ppc64le_ib/flux-0.14.x-20200127/bin/flux start ~/ip.sh
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen
ssh://rzansel17/var/tmp/flux-eDbAFt/0

@dongahn
Copy link
Member Author

dongahn commented Feb 15, 2020

But another problem is that I can't do flux-proxy with this version:

rzansel62{dahn}23: /usr/global/tools/flux/blueos_3_ppc64le_ib/flux-0.14.x-20200127/bin/flux proxy ssh://rzansel17/var/tmp/flux-eDbAFt/0
flux-relay: local:///var/tmp/flux-eDbAFt/0: Connection refused
flux-proxy: ssh://rzansel17/var/tmp/flux-eDbAFt/0: Connection reset by peer

@SteVwonder
Copy link
Member

I believe the issue is that ip.sh is producing an out-of-date format for the remote URI.

Try appending local to the end:

herbein1@lassen709 ~
% /usr/global/tools/flux/blueos_3_ppc64le_ib/flux-0.14.x-20200127/bin/flux proxy ssh://lassen708//var/tmp/flux-dVFGMH/0/local flux hwloc info
1 Machine, 32 Cores, 128 PUs

@SteVwonder
Copy link
Member

Good news: Adding PMIX_MCA_gds="^ds12,ds21" to use the basic hash within PMIx works around this issue.

PS - This is awesome news!

@dongahn
Copy link
Member Author

dongahn commented Feb 15, 2020

Try appending local to the end:

Yap. That did that trick.

We need to update our man page then:
https://github.com/flux-framework/flux-core/blob/9391b2027a981f7185d9c9136fcdad43ca6e3ab5/doc/man1/flux-proxy.adoc

@dongahn
Copy link
Member Author

dongahn commented Feb 15, 2020

Hopefully, combing all these tricks, our CORAL workflow users can now make progress. I will also create an issue ticket to increase the buffer size kvsname.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants