Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reserveCells crashes mint mesh inside Umpire #1271

Merged
merged 21 commits into from
Oct 10, 2024

Conversation

gunney1
Copy link
Contributor

@gunney1 gunney1 commented Feb 6, 2024

Crash reserving space for UnstructuredMesh

  • This PR is a bug fix
  • It does the following:
    • Adds tests that verify the crash (which happens on our docker platforms)
    • Fix the bug.

The cause of the bug is the combination of static Umpire library with shared Axom library in Docker. Somehow, this resulted in 2 singleton ResourceManagers. The root cause of the crash in reserveCells was when the second singleton tried to reallocate memory allocated by the first. Fixed by building Umpire shared. We updated the Docker image to make Umpire shared.

This crash had been plaguing PR #1263. The MarchingCubes test crashed while asking mint::UnstructuredMesh to reserve space.

Two identical tests were added to monitor this behavior:

  • In src/axom/quest/tests/quest_initialize.cpp
  • In src/axom/mint/tests/mint_mesh.cpp

Interesting behaviors:

  • The test added to quest reproduced the crash, but the one added to mint did not.
  • The crashes only occurred in our docker builds.

Addresses issue #1287

@gunney1 gunney1 self-assigned this Feb 6, 2024
@gunney1 gunney1 added Sidre Issues related to Axom's 'sidre' component Quest Issues related to Axom's 'quest' component Mint Issues related to Axom's 'mint' component labels Feb 6, 2024
@gunney1 gunney1 force-pushed the bugfix/gunney/ug-immediate-reserve branch from 643c441 to f25bdc4 Compare February 7, 2024 00:36
@gunney1 gunney1 added the bug Something isn't working label Feb 7, 2024
@gunney1 gunney1 added this to the April 2024 Release milestone Feb 7, 2024
@gunney1
Copy link
Contributor Author

gunney1 commented Feb 22, 2024

Some new observations:

  1. Based on @white238's guess, I disabled MPI (and MFEM, which was configured with MPI) from the failing config. This fixed the failure. Not exactly sure what the mechanism of failure was.
  2. I tried to construct the UnstructuredMesh without using Sidre. This also gets past the crash. This provides a work-around for this bug, but it effectively disables some Sidre-dependent features when using the UnstructuredMesh.

@gunney1
Copy link
Contributor Author

gunney1 commented Jun 10, 2024

After merging in the latest develop branch, the failed test (quest_initialize) continues to fail. But if I remove the -DBUILD_SHARED_LIBS=ON from the configuration, the tests pass. @white238 What do you think of this clue?

I don't know the true cause of the crash or why it only exhibits on
our docker environments.  An ideal fix based on an understanding of
the bug should be pursued.

This commit also backs out work-arounds for avoiding trigging the crash.
@gunney1 gunney1 force-pushed the bugfix/gunney/ug-immediate-reserve branch from 0c089b5 to c96baed Compare October 6, 2024 13:49
@@ -9,8 +9,8 @@ variables:
DO_BUILD: 'yes'
DO_TEST: 'yes'
DO_CLEAN: 'no'
CLANG14_IMAGENAME: 'axom/tpls:clang-14_09-23-24_17h-07m'
GCC13_IMAGENAME: 'axom/tpls:gcc-13_09-23-24_17h-07m'
CLANG14_IMAGENAME: 'axom/tpls:clang-14_10-10-24_01h-31m'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change required?

Copy link
Contributor Author

@gunney1 gunney1 Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The new images have Umpire shared libs. That's the key to the bug fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@white238 @kennyweiss are you OK with switching to Umpire shared libs for azure testing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We (@kennyweiss , @gunney1 , @davidbeckingsale ) met about this last night. Basically a static umpire with a shared axom creates multiple singletons in Umpire. We need to not allow this to happen in our Spack recipe as well but this seems to be the only way around this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean add a constraint in the Axom Spack package that static Umpire conflicts with shared Axom?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and/or a conflict

Copy link
Member

@kennyweiss kennyweiss Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- the three of us had a meeting yesterday w/ @davidbeckingsale and concluded that the static umpire combined w/ a shared axom in the docker configuration was leading to more than one umpire instance.

There might be a solution to this w/ improved linking orders and/or linker flags, but it will be difficult to track down in the short term, and is not necessary for @gunney1's actual deliverable (i.e. the CI is testing a different configuration than the one we actually need). @gunney1 is planning to create an issue summarizing that discussion so we can look into it in the future.

It's also worth noting that we've encountered similar issues in the past when mixing libraries w/ singletons with upstream shared or Fortran libraries (e.g. #865 and #270).

Edit: My response took too long and @white238 beat me to the punch

@@ -165,7 +165,7 @@ spack:
scr:
require: "@3.0.1~shared~tests~examples"
umpire:
require: "@2024.07.0~shared~examples~werror"
require: "@2024.07.0+shared~examples~werror"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhornung67 -- the change to the docker TPL configuration is actually in this line -- we are now forcing Umpire to build as a shared library for better compatibility w/ the shared axom config.

Copy link
Member

@kennyweiss kennyweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gunney1

Please also test this branch w/ AXOM_MINT_USE_SIDRE=OFF to ensure that all sidre usage w/ mint is properly guarded (see: #1435).

src/axom/quest/tests/quest_initialize.cpp Show resolved Hide resolved
src/axom/quest/tests/quest_initialize.cpp Show resolved Hide resolved
src/axom/quest/tests/quest_initialize.cpp Show resolved Hide resolved
@gunney1 gunney1 merged commit f596660 into develop Oct 10, 2024
13 checks passed
@gunney1 gunney1 deleted the bugfix/gunney/ug-immediate-reserve branch October 10, 2024 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Mint Issues related to Axom's 'mint' component Quest Issues related to Axom's 'quest' component Sidre Issues related to Axom's 'sidre' component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants