-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reserveCells
crashes mint
mesh inside Umpire
#1271
Conversation
643c441
to
f25bdc4
Compare
Some new observations:
|
After merging in the latest develop branch, the failed test ( |
I don't know the true cause of the crash or why it only exhibits on our docker environments. An ideal fix based on an understanding of the bug should be pursued. This commit also backs out work-arounds for avoiding trigging the crash.
0c089b5
to
c96baed
Compare
New tests reproduces the crash without involving mint and seems to implicate sidre::View memory management.
See if that fixes the crash. Our hypothesis is that static Umpire and shared Axom somehow leads to two ResourceManager singletons.
@@ -9,8 +9,8 @@ variables: | |||
DO_BUILD: 'yes' | |||
DO_TEST: 'yes' | |||
DO_CLEAN: 'no' | |||
CLANG14_IMAGENAME: 'axom/tpls:clang-14_09-23-24_17h-07m' | |||
GCC13_IMAGENAME: 'axom/tpls:gcc-13_09-23-24_17h-07m' | |||
CLANG14_IMAGENAME: 'axom/tpls:clang-14_10-10-24_01h-31m' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The new images have Umpire shared libs. That's the key to the bug fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@white238 @kennyweiss are you OK with switching to Umpire shared libs for azure testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We (@kennyweiss , @gunney1 , @davidbeckingsale ) met about this last night. Basically a static umpire with a shared axom creates multiple singletons in Umpire. We need to not allow this to happen in our Spack recipe as well but this seems to be the only way around this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean add a constraint in the Axom Spack package that static Umpire conflicts with shared Axom?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes and/or a conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes -- the three of us had a meeting yesterday w/ @davidbeckingsale and concluded that the static umpire combined w/ a shared axom in the docker configuration was leading to more than one umpire instance.
There might be a solution to this w/ improved linking orders and/or linker flags, but it will be difficult to track down in the short term, and is not necessary for @gunney1's actual deliverable (i.e. the CI is testing a different configuration than the one we actually need). @gunney1 is planning to create an issue summarizing that discussion so we can look into it in the future.
It's also worth noting that we've encountered similar issues in the past when mixing libraries w/ singletons with upstream shared or Fortran libraries (e.g. #865 and #270).
Edit: My response took too long and @white238 beat me to the punch
@@ -165,7 +165,7 @@ spack: | |||
scr: | |||
require: "@3.0.1~shared~tests~examples" | |||
umpire: | |||
require: "@2024.07.0~shared~examples~werror" | |||
require: "@2024.07.0+shared~examples~werror" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhornung67 -- the change to the docker TPL configuration is actually in this line -- we are now forcing Umpire to build as a shared library for better compatibility w/ the shared axom config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crash reserving space for
UnstructuredMesh
The cause of the bug is the combination of static Umpire library with shared Axom library in Docker. Somehow, this resulted in 2 singleton
ResourceManagers
. The root cause of the crash inreserveCells
was when the second singleton tried to reallocate memory allocated by the first. Fixed by building Umpire shared. We updated the Docker image to make Umpire shared.This crash had been plaguing PR #1263. The
MarchingCubes
test crashed while askingmint::UnstructuredMesh
to reserve space.Two identical tests were added to monitor this behavior:
src/axom/quest/tests/quest_initialize.cpp
src/axom/mint/tests/mint_mesh.cpp
Interesting behaviors:
quest
reproduced the crash, but the one added tomint
did not.Addresses issue #1287