Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DAG Generation and Enable Handling Initial Conditions #3101

Closed
wants to merge 5 commits into from

Conversation

mjs271
Copy link
Contributor

@mjs271 mjs271 commented Nov 7, 2024

This PR modifies the atmospheric DAG generator (atmosphere_process_dag.xpp) and atmosphere driver to get things working again. Namely,

  • There are now 4 checkpoints during AtmosphereDriver::initialize() at which we build the most complete DAG available and write it to file.
    • This is intended to serve as a diagnostic tool in case a build crashes, giving the user information about the created/initialized processes or fields.
    • Some examples are below that show the incremental increase in the displayed information.
    • These checkpoints occur at the end of the following functions, and the associated DAG is written out to a .dot file in the run directory:
      1. create_atm_processes()--scream_atm_createProc_dag.dot
      2. create_fields()--scream_atm_createField_dag.dot
      3. initialize_fields()--scream_atm_initField_dag.dot
      4. initialize_atm_procs()--scream_atm_initProc_dag.dot
  • I've added some code that enables the DAG to properly handle initial conditions and determine whether they are missing at the time when he DAG is generated.
    • If, at the time of generation, a required field (or group) is missing it is indicated on the corresponding node, as can be seen in the 2nd or 3rd image below.
    • Before the 4th and final DAG, (within initialize_atm_procs()), an extra box is added to the DAG with a disclaimer that any fields indicated as missing may be provided by the forthcoming initial condition file.
    • A node for the Initial Condition is now displayed on the DAG with edges leading to the
  • I've made some minor formatting changes for readability and to more clearly indicate the different types of objects represented on the DAG--e.g., the Begin/End of Atm. Timestep nodes are colored differently, and fields are printed in green when provided directly by the initial condition at initialization.
  • I've also fenced-off the write_dag() statements to ensure they are only printed by the main MPI thread, since I noticed occasional formatting errors due to a presumptive race condition.

createField_dag
initField_dag
initProc_dag

@bartgol
Copy link
Contributor

bartgol commented Nov 7, 2024

Before reviewing, a couple of comments:

  • what's the point of the 1st checkpoint? It seems to only print boxes for processes, without edges. I think we can get rid of that (unless I am missing something).
  • I don't see what fig 3 adds. It's an interpolation point between 2 and 4. I think we can do away with it. Maybe we can add a box message to fig 4 that says "green fields are read from IC file and never updated", if you really want to be verbose.

Bottom line, I think we can reduce to 2 dags: before and after reading the IC file.

@mjs271 mjs271 force-pushed the mjs/eamxx/rework-DAG branch 2 times, most recently from 94d0747 to 4bb84d0 Compare November 7, 2024 18:22
Copy link
Contributor

mergify bot commented Nov 7, 2024

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Enforce checks passing

This rule is failing.

Make sure that checks are not failing on the PR, and reviewers approved

  • #approved-reviews-by >= 1
  • any of:
    • check-skipped={% raw %}gcc-openmp / ${{ matrix.build_type }}{% endraw %}
    • all of:
      • check-success="gcc-openmp / dbg"
      • check-success="gcc-openmp / fpe"
      • check-success="gcc-openmp / opt"
      • check-success="gcc-openmp / sp"
  • any of:
    • check-skipped={% raw %}gcc-cuda / ${{ matrix.build_type }}{% endraw %}
    • all of:
      • check-success="gcc-cuda / dbg"
      • check-success="gcc-cuda / opt"
      • check-success="gcc-cuda / sp"
  • #changes-requested-reviews-by == 0
  • any of:
    • all of:
      • check-success="cpu-gcc / ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.scream-small_kernels--scream-output-preset-5"
      • check-success="cpu-gcc / ERS_Ln9.ne4_ne4.F2000-SCREAMv1-AQP1.scream-output-preset-2"
      • check-success="cpu-gcc / ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.scream-dpxx-arm97"
      • check-success="cpu-gcc / SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.scream-mam4xx-all_mam4xx_procs"
    • check-skipped={% raw %}cpu-gcc / ${{ matrix.test.short_name }}{% endraw %}
  • any of:
    • check-skipped=cpu-gcc
    • check-success=cpu-gcc

@mjs271
Copy link
Contributor Author

mjs271 commented Nov 7, 2024

what's the point of the 1st checkpoint? It seems to only print boxes for processes, without edges. I think we can get rid of that (unless I am missing something).

I'm good with getting rid of if. My thought was to generate the DAG each time, during initialization, that more information is available that'd be reflected in the DAG, in case the driver crashes immediately after the point that the DAG is generated. For the first checkpoint, boxes with no edges, it at least reports what processes have been successfully initialized.

However, I do agree that it's a low-information figure, so I'm good to lose it--unless generating it and then overwriting with the next DAG iteration is preferred (the previous behavior of the DAG workflow).

I don't see what fig 3 adds. It's an interpolation point between 2 and 4. I think we can do away with it. Maybe we can add a box message to fig 4 that says "green fields are read from IC file and never updated", if you really want to be verbose.

Also good with losing fig 3 based on the above perspective.

I'll add that info box--do you have strong opinions on whether you'd like to control that behavior with verbosity level (more so than it already is)? @bartgol

@mjs271 mjs271 force-pushed the mjs/eamxx/rework-DAG branch from 13d622b to 942b07b Compare November 7, 2024 21:51
@bartgol
Copy link
Contributor

bartgol commented Nov 8, 2024

@mjs271 I think the real purpose of the DAG is to see what the dependencies are if something fails at runtime (and possibly to throw if deps are unmet during init). If the code crashes for whatever other reason during init, I doubt that inspecting the DAG would offer any insight.

If the DAG is needed for debug, we're likely past the point where they are all written, and since the user can ask for more/less detail with the dag verbosity param, I think we're covered there.

I'll add that info box--do you have strong opinions on whether you'd like to control that behavior with verbosity level (more so than it already is)? @bartgol
I think you can put that box in as soon as the verb level is such that fields names are written in the nodes

@bartgol bartgol force-pushed the master branch 2 times, most recently from bde4a4b to 035d8c2 Compare November 12, 2024 21:54
@bartgol
Copy link
Contributor

bartgol commented Nov 13, 2024

@mjs271 What's the status? Are you planning on pushing new changes, or are you waiting on feedback?

@mjs271
Copy link
Contributor Author

mjs271 commented Nov 13, 2024

@bartgol Just got back to this and have some changes pushed. I'm happy with it if others are

@mjs271
Copy link
Contributor Author

mjs271 commented Nov 13, 2024

Example of newly-added explanatory box:

scream_atm_initProc_dag

@bartgol bartgol force-pushed the master branch 2 times, most recently from 6318f41 to db5b35d Compare November 13, 2024 21:48
bartgol
bartgol previously approved these changes Nov 13, 2024
tcclevenger
tcclevenger previously approved these changes Nov 13, 2024
@AaronDonahue
Copy link
Contributor

AaronDonahue commented Nov 14, 2024

Will this address #1256 ?

@bartgol
Copy link
Contributor

bartgol commented Nov 15, 2024

Will this address #1256 ? @luca?

I think you got the wrong Luca there. :)

I think that issue is broader, but I also think we can close it. I don't think we want the IC to be its own atm process, considering that it would only run during init.

@mjs271
Copy link
Contributor Author

mjs271 commented Nov 19, 2024

@bartgol --in response to your comment, maybe best to merge this PR now/soon to avoid additional delay? Then I'll open a new one that addresses #3018, as well as:

  • runtime checks for missing/uninitialized fields and erroring out on failure
  • adding a WILL_FAIL test that confirms detection of missing/uninitialized fields

@mjs271 mjs271 force-pushed the mjs/eamxx/rework-DAG branch from bea7d37 to 6ad603e Compare November 20, 2024 18:00
@bartgol
Copy link
Contributor

bartgol commented Nov 20, 2024

@mjs271 We do get some SIGSEV error in the CUDA builds. Might be worth taking a look.

@mjs271
Copy link
Contributor Author

mjs271 commented Nov 20, 2024

@bartgol haha, thanks--rebased and reran to see if that could be the issue. Will knock that out shortly! 😅

@mjs271 mjs271 force-pushed the mjs/eamxx/rework-DAG branch from 6ad603e to cf476bb Compare November 21, 2024 00:32
@mergify mergify bot dismissed stale reviews from tcclevenger and bartgol November 21, 2024 00:33

Pull request has been modified.

@mahf708 mahf708 closed this Dec 19, 2024
oksanaguba pushed a commit that referenced this pull request Jan 18, 2025
Finishes PR #3101 originally submitted to the scream repo the atmospheric DAG generator (atmosphere_process_dag.xpp) and atmosphere driver to get things working again
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants