-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set rrtmgp default backend back to kokkos #3030
Conversation
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
Using Repos:
Pull Request Author: jgfouca |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
SCREAM_PullRequest_Autotester_Weaver # 6114 FAILED (click to see last 100 lines of console output)
SCREAM_PullRequest_Autotester_Mappy # 5880 FAILED (click to see last 100 lines of console output)
|
I'm unable to build with this branch.
|
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
Using Repos:
Pull Request Author: jgfouca |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
SCREAM_PullRequest_Autotester_Weaver # 6119 FAILED (click to see last 100 lines of console output)
SCREAM_PullRequest_Autotester_Mappy # 5885 FAILED (click to see last 100 lines of console output)
|
@jgfouca , it looks like all the AT fails are unit tests w/ rrtmgp in them. Are these expected fails? Can we merge? |
@AaronDonahue , not yet. I goofed pretty badly and did a lot of testing with Kokkos off (so yakl rrtmgp). I'm seeings loads of problems now that I corrected that. |
okay cool, just checking as I do my audit of the open PRs |
d3d549a
to
1a701dd
Compare
…rface_back_to_kokkos * origin/master: (75 commits) add EAMxx initial condition for ne4 with L128 change from scream-docs to eamxx-scripts EAMxx: flush output file whenever we write a rhist file EAMxx: add support for daily storage type in IO EAMxx: always update num snaps in file, even for Yearly/Monthly storage clarify P3Runtime struct defaults This commit removes a docker folder in eamxx no longer in use. restore BFB in order to merge PR Remind devs to load scream env Add some Kokkos dev doc EAMxx: work on atm process developer documentation EAMxx: removed undefined method in AbstractGrid Update EAMxx docs and improve help formatters Rename set_params -> setup_internals Improve OutputManager function descriptions Separate restart output manager from output manager list Only call set_logger() once Fix run_t0 and case_t0 checks fix incorrect inputs Move set_provenence_data call to in front of create_output_managers ...
OK, I think this is finally ready. |
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
Using Repos:
Pull Request Author: jgfouca |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
SCREAM_PullRequest_Autotester_Weaver # 6202 FAILED (click to see last 100 lines of console output)
SCREAM_PullRequest_Autotester_Mappy # 5953 FAILED (click to see last 100 lines of console output)
|
@jgfouca I set WIP so the AT stops testing (it retests if master changes) until you decide on merging. Regarding the memory: how much more memory are we talking about? If it's not too much more, we may just merge. |
@jgfouca , i'm happy to leave the merge decision up to you. I'm happy to wait while you investigate w/ Noel, or merge it. |
@@ -240,7 +240,7 @@ static void rrtmgp_initialize( | |||
load_cld_lutcoeff(cloud_optics_lw_k, cloud_optics_file_lw); | |||
|
|||
// initialize kokkos rrtmgp pool allocator | |||
const size_t base_ref = 18000; | |||
const size_t base_ref = 4000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's base_ref used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit hacky. When I was doing the standalone rrtmgp work, I tried to come up with a simple function for the amount of pool memory needed. I found that comparing (ncol * nlay * nlev) to base_ref got me a reasonable estimate, but for some reason, way less is needed for full cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just the "initial" guess for the pool, and then it can be expanded if needed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean. We can change this number in the source code (which reqs a rebuild to take effect). The pool cannot expand at runtime; so if we overflow it, the case will crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. I thought it was like YAKL, which whenever the pool fills, opens another pool.
Is there any back of the envelope calculation to ensure this size is enough then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bartgol , I'm not sure YAKL works that way. There's been many times we've had to increase GATOR_INITIAL_MB
.
I arrived at this number through experimentation on rrtmgp standalone.
@ndkeen , thanks! Is there a way to get a line number or kernel name to see where these mallocs are happening? |
No, the profiler uses sampling. Previously, these CudaMallocs are from Kokkos views. |
@ndkeen , that is so weird. Can you confirm that you are on the correct rrtmgp sha? If you look at the rrtmgp file Within the main rrtmgp impl function, |
I did a fresh checkout of |
6318f41
to
db5b35d
Compare
…rface_back_to_kokkos * origin/master: (194 commits) Fix threads arg Fixes for p3 tests CMake settings Mergify: fix merge proteciton rule Fixup small kernel situation Quick fix for baseline generation Mergify: remove redundant merge protection condition in automerge rule Mergify: set commit message for automerge Mergify: enable pull request automerge Update .mergify.yml This commit fixes a bug in defining an int layout for hyai and hybi EAMxx - Replace 'verti' with 'elevated' in tag names; add missing lines. EAMxx - Rename tag names: 'elevated' replaces 'verti' EAMxx - Rename variables: 'elevated_' prefix replaces 'vert_'. EAMxx - Rename enum VERT_EMISSION to ELEVATED_EMISSIONS. Update based on better ekat flag handling EMAxx - Update comment in enum TracerFileType to avoid confusion. Workflows: fix logic to execute/skip eamxx testing workflows EAMxx: Fixes a comment in the newly added test EAMxx: Fixes a comment in the microphysics testmod file EAMxx: Fix a comment in the shell script ...
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟠 Enforce checks passingWaiting checks:
|
7ceb74c
to
6ce5b69
Compare
Non-determinism and poor performance have been resolved.