Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STK::io::create_ioss_region() throw/error #13389

Closed
spdomin opened this issue Aug 26, 2024 · 6 comments
Closed

STK::io::create_ioss_region() throw/error #13389

spdomin opened this issue Aug 26, 2024 · 6 comments
Labels
type: bug The primary issue is a bug in Trilinos code or tests

Comments

@spdomin
Copy link
Contributor

spdomin commented Aug 26, 2024

I have a throw in stk::io::Inputfile::create_ioss_region() in InputFile.cpp:

what():  global_minmax - Attempting mpi while in barrier owned by 2
terminate called after throwing an instance of 'std::runtime_error'

This is occurring in a Nalu regression tests that has to "realms" - one for multiphysics, and the other for IO output. Otherwise, there is no real specialty about this test. We even have other tests that have this pattern.

The good/bad occurred over the recent fmt issue, so my bisect is problematic.

Good:
NaluCFD/Nalu SHA1: 1ef81b6de5bbf1964d8bec6b0b64810def33b123 Trilinos/develop SHA1: c8548cf7bdc5a50daff9fdf93d493228a74a3973

Bad:
NaluCFD/Nalu SHA1: 1ef81b6de5bbf1964d8bec6b0b64810def33b123 Trilinos/develop SHA1: 4b4c11941eb02f08372ba993bd4d54fcb0625ffa

Here are a few snapshots of the call stack:

IossA IossB IossC IossD

@alanw0 or @gdsjaar - has anything in STK::io recently changed?

@spdomin spdomin added the type: bug The primary issue is a bug in Trilinos code or tests label Aug 26, 2024
@spdomin
Copy link
Contributor Author

spdomin commented Aug 26, 2024

Oh, this is a test that activates:

output: serialized_io_group_size: 2 output_data_base_name: heliumPlume.e output_frequency: 4 output_node_set: no compression_level: 9 compression_shuffle: yes

If I remove the serialized_io_group_size: 2, all is well. I am not sure if we ever run with this option in simulations (I have a vague recollection as to why we added it).

Let me know how to proceed.

@alanw0
Copy link
Contributor

alanw0 commented Aug 26, 2024

@spdomin @gsjaardema Tolu tells me that serialized_io_group_size is at the IOSS level.

@gsjaardema
Copy link
Contributor

It is probably related to the latest SEACAS snapshot into Trilinos. I will see if I can reproduce and get in a fix...

The serialize_io was added for one of the HPC systems that had an easily overloaded file system. It specifies the maximum number of ranks that should be hitting the file system (exodus reads/writes) at one time. There are a couple tests in SEACAS that exercise this, but obviously not covering all the cases...

@spdomin
Copy link
Contributor Author

spdomin commented Aug 26, 2024

Dear @gsjaardema - sounds great. Let me know if you need more details on the Nalu test.

I pulled my Trinity 4000 node DNS input files and noted that we did not activate this option for our nearly 256K MPI rank sims... I wonder if we still need this option:)

@gsjaardema
Copy link
Contributor

I havent't heard of anyone using it lately, but it (usually) doesn't cause much overhead in development and there may be customers using it externally that I am not aware of. I may look into deprecating it and see if I get any complaints/comments.

I think I know where I messed up and can trigger the error on just an io_shell run, so should have a fix soon...

@spdomin
Copy link
Contributor Author

spdomin commented Aug 27, 2024

Former Nalu failing tests now look clean.

@spdomin spdomin closed this as completed Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

3 participants