-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latest flux-core cannot be bootstrapped on CORAL #2684
Comments
Not sure I'm the right person to help here, but can you try with |
diff --git a/src/broker/pmiutil.h b/src/broker/pmiutil.h
index fe188d91d..d9cffe781 100644
--- a/src/broker/pmiutil.h
+++ b/src/broker/pmiutil.h
@@ -14,7 +14,7 @@
struct pmi_params {
int rank;
int size;
- char kvsname[64];
+ char kvsname[1024];
};
struct pmi_handle; If that doesn't help and traces say that's where it's failing then we may need to make that code request the length from the server and dynamically allocate the buffer. |
Thanks @grondo and @garlick. I will take a look. So the keys being used by our PMI library have not changed recently? If I remember right, @SteVwonder parsed these keys and fetch the rank fields to work around an PMIx bug within his patched pmi4pmix. |
The error you quoted above suggests the broker is failing in The PMI code in the broker was rewritten since 0.11. The old code called I'm assuming the problem is that the CORAL PMI wants to return a long KVS name and can't. Please try the quick fix above and if that works, then we can change the title of this bug to reflect the changes needed in the broker. |
I applied the patch from #2684 (comment) but this version of flux now hangs. Throwing in lassen617{dahn}25: env FLUX_PMI_DEBUG=1 PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 4 /usr/global/tools/flux/blueos_3_ppc64le_ib/flux-0.14.x-20200127/bin/flux start ~/ip.sh
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen |
Sorry, the broker PMI code doesn't really have much instrumentation since it doesn't go through our PMI library anymore. You might try adding This is the command line I used to trace PMI usage of an MPI hello world program under slurm:
use this ltrace.conf. Alternatively, there may be a |
I should have added: I think we are getting past |
That ltrace.conf is a nice feature. Assuming we might be hitting problems like this often, we should install that file somewhere and offer a canned script that could be used with e.g. |
Here is the ltrace output leading to the hang. I just ran this with one node. |
|
instead of running
However, no matter what I do I can't get any |
I had the same problem on fluke towards the end of the day and figured it must be me! @dongahn: is the PMIx output from setting mca debug? This is probably naive, but the last call is a |
Following up on our discussion in todays meeting, the broker's PMI sequence is as follows:
So key names haven't changed since 0.11. One thing that's changed though: in 0.11, all ranks put a uri; in current code, the TBON leaf nodes do not (since nothing connects to them). |
What is the commit where this change was made? I can maybe try that version to see if it works. |
Good news: Adding @JaeseungYeom and @jameshcorbett: you can make some progress using the version I installed below. rzansel62{dahn}25: env FLUX_PMI_DEBUG=1 PMIX_MCA_gds="^ds12,ds21" PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 4 /usr/global/tools/flux/blueos_3_ppc64le_ib/flux-0.14.x-20200127/bin/flux start ~/ip.sh
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen
flux-broker: dlopen /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/20191120/lib/libpmi.so
flux-broker: using dlopen
ssh://rzansel17/var/tmp/flux-eDbAFt/0 |
But another problem is that I can't do rzansel62{dahn}23: /usr/global/tools/flux/blueos_3_ppc64le_ib/flux-0.14.x-20200127/bin/flux proxy ssh://rzansel17/var/tmp/flux-eDbAFt/0
flux-relay: local:///var/tmp/flux-eDbAFt/0: Connection refused
flux-proxy: ssh://rzansel17/var/tmp/flux-eDbAFt/0: Connection reset by peer |
I believe the issue is that Try appending
|
PS - This is awesome news! |
Yap. That did that trick. We need to update our man page then: |
Hopefully, combing all these tricks, our CORAL workflow users can now make progress. I will also create an issue ticket to increase the buffer size kvsname. |
I am helping @JaeseungYeom to start to test and explore his flux-dyad on LBANN on Lassen. It seems the latest and greatest flux-core version fails to be bootstrapped when we used the PMIx workaround Stephen put together for other workflow teams like MuMMI.
@JaeseungYeom and I confirmed that the workaround still works with the older flux version that MuMMI team is using (flux-0.11.x-20190425). But @JaeseungYeom needs the latest and greatest version to make progress. In fact, @jameshcorbett will need the same version for his UPQ work as well.
My guess is the keys being used by our PMI library has changed in the latest and greatest such that the work around doesn't work any longer...
Any idea as to how to fix this quickly?
The text was updated successfully, but these errors were encountered: