Skip to content

Linkers

Jeff Squyres edited this page Jun 29, 2017 · 16 revisions

How Open MPI Interacts with Linkers

Content of this wiki page was taken from a thread on the Users mailing list started in October of 2010:

  1. Start of thread on the users mailing list
  2. Post where these tables / wiki content came from

This first table represents what happens in the following scenarios:

  • compile an application against Open MPI's libmpi, or
  • compile an "application" DSO that is dlopen'ed with RTLD_GLOBAL, or
  • explicitly dlopen Open MPI's libmpi with RTLD_GLOBAL
App linked against libmpi includes components? OMPI DSO components available? OMPI DSO components depend on libmpi.so? Result
1. libmpi.so no no NA won't run
2. libmpi.so no yes no yes
3. libmpi.so no yes yes yes (A)
4. libmpi.so yes no NA yes
5. libmpi.so yes yes no maybe (B)
6. libmpi.so yes yes yes maybe (C)
7. libmpi.a no no NA won't run
8. libmpi.a no yes no yes (D)
9. libmpi.a no yes yes no (E)
10. libmpi.a yes no NA yes
11. libmpi.a yes yes no maybe (F)
12. libmpi.a yes yes yes no (G)

All libmpi.a scenarios assume that libmpi.so is also available.

In the OMPI v1.2 series, most components link against libmpi.so, but some do not (it's our mistake for not being uniform).

A. As far as we know, this works on all platforms that have dlopen (i.e., almost everywhere). But we've only tested (recently) Linux, OSX, and Solaris. These 3 dynamic loaders are smart enough to realize that they only need to load libmpi.so once (i.e., that the implicit dependency of libmpi.so brought in by the components is the same libmpi.so that is already loaded), so everything works fine.

B. If the same component is both in libmpi and available as a DSO, the same symbols will be defined twice when the component is dlopen'ed and Badness will ensure. If the components are different, all platforms should be ok.

C. Same caveat as B about if a components is both in libmpi and available as a DSO. Same as A for whether libmpi.so is loaded multiple times by the dynamic loader or not.

D. Only works if the application was compiled with the equivalent of the GNU linker's --whole-archive flag.

E. This does not work because libmpi.a will be loaded and libmpi.so will also be pulled in as a dependency of the components. As such, all the data structures in libmpi will [attempt to] be in the process twice: the "main libmpi" will have one set and the libmpi pulled in by the component dependencies will have a different set. Nothing good will come of that: possibly dynamic linker run-time symbol conflicts or possibly two separate copies of the symbols. Both possibilities are Bad.

F. Same caveat as B about if a components is both in libmpi and available as a DSO.

G. Same problem as E.


This second table represents what happens in the following scenarios:

  • compile an "application" DSO that is dlopen'ed with RTLD_LOCAL, or
  • explicitly dlopen Open MPI's libmpi with RTLD_LOCAL
                                            OMPI DSO
    App          libmpi        OMPI DSO     components
    DSO linked   includes      components   depend on
    against      components?   available?   libmpi.so?   Result
    ----------   -----------   ----------   ----------   ----------
13. libmpi.so        no           no            NA       won't run
14. libmpi.so        no           yes           no       no (*8*)
15. libmpi.so        no           yes           yes      maybe (*9*)
16. libmpi.so        yes          no            NA       ok
17. libmpi.so        yes          yes           no       no (*10*)
18. libmpi.so        yes          yes           yes      maybe (*11*)
    ----------  ------------  ----------  ------------   ----------
19. libmpi.a         no           no            NA       won't run
20. libmpi.a         no           yes           no       no (*12*)
21. libmpi.a         no           yes           yes      no (*13*)
22. libmpi.a         yes          no            NA       ok
23. libmpi.a         yes          yes           no       no (*14*)
24. libmpi.a         yes          yes           yes      no (*15*)
    ----------  ------------  ----------  ------------   --------

All libmpi.a scenarios assume that libmpi.so is also available.

(8) This does not work because the OMPI DSOs require symbols in libmpi that will not be able to be found because libmpi.so was not loaded in the global scope.

(9) This is a fun case: the Linux dynamic linker is smart enough to make it work, but others likely will not. What happens is that libmpi.so is loaded in a LOCAL scope, but then OMPI dlopens its own DSOs that require symbols from libmpi. The Linux linker figures this out and resolves the required symbols from the already-loaded LOCAL libmpi.so. Other linkers will fail to figure out that there is a libmpi.so already loaded in the process and will therefore load a 2nd copy. This results in the problems cited in (5).

(10) This does not work either a) because of the caveat stated in (2) or b) because the unresolved symbol issue stated in (8).

(11) This may not work either because of the caveat stated in (2) or because the duplicate libmpi.so issue cited in (9). If you are using the Linux linker, then (9) is not an issue, and it should work.

(12) Essentially the same as the unresolved symbol issue cited in (8), but with libmpi.a instead of libmpi.so.

(13) Worse than (9); the Linux linker will not figure this one out because the libmpi.so symbols are not part of "libmpi" -- they are simply part of the application DSO and therefore there's no way for the linker to know that by loading libmpi.so, it's going to be loading a 2nd set of the same symbols that are already in the process. Hence, we devolve down to the duplicate symbol issue cited in (5).

(14) This does not work either a) because of the caveat stated in (2) or b) because the unresolved symbols issue stated in (8).

(15) This may not work either because of the caveat stated in (2) or because the duplicate libmpi.so issue cited in (13).


In the OMPI v1.2 series, most OMPI configurations fall into scenarios 2 and 3 (as I mentioned above, we have some components that link against libmpi and others that don't -- our mistake for not being consistent).

The problematic scenario that the R and Python MPI plugins are running into is 14 because the osc_pt2pt component does not link against libmpi. Most of the rest of our components do link against libmpi, and therefore fall into scenario 15, and therefore work on Linux (but possibly not elsewhere).

With all this being said, if you are looking for a general solution for the Python and R plugins, dlopen() of libmpi with RTLD_GLOBAL before MPI_INIT seems to be the way to go. Specifically, even if we updated osc_pt2pt to link against libmpi, that will work on Linux, but not elsewhere. dlopen'ing libmpi with GLOBAL seems to be the most portable solution.

Indeed, table 1 also suggests that we should change our components (as Brian suggests) to all not link against libmpi, because then we'll gain the ability to work properly with a static libmpi.a, putting OMPI's common usage into scenarios 2 and 8 (which is better than the 2, 3, 8, and 9 scenarios that are used today, which means we don't work with libmpi.a).

...but I think that this would break the current R and Python plugins until they put in the explicit call to dlopen().

Clone this wiki locally