Add the capability to use cached meshes and ICs #184

xylar · 2021-07-21T11:35:26Z

This merge adds a new capability for steps to have cached outputs in a compass_cache database. Files in this database have a directory structure similar to the work directory (but without the MPAS core subdirectory, which is redundant). The files include a date stamp so that new revisions can be added without removing older ones (supported by older compass versions).

As an example, here are cached files for the mesh and init test cases for the QU240 mesh:

compass_cache/global_ocean/QU240/mesh/mesh/critical_passages_mask_final.210727.nc
compass_cache/global_ocean/QU240/mesh/mesh/culled_graph.210727.info
compass_cache/global_ocean/QU240/mesh/mesh/culled_mesh.210727.nc
compass_cache/global_ocean/QU240/PHC/init/initial_state/init_mode_forcing_data.210727.nc
compass_cache/global_ocean/QU240/PHC/init/initial_state/initial_state.210727.nc

The mapping between outputs in the cached versions of steps and the files in the compass_cache database are maintained in a file in compass.<mpas_core> called cached_files.json. This file contains a python dictionary that maps from the output files in the cached versions of each step (the symlinks) to those in the database (the targets). For example:

{
    "ocean/global_ocean/cached/QU240/mesh/mesh/culled_mesh.nc": "global_ocean/QU240/mesh/mesh/culled_mesh.210727.nc",
    "ocean/global_ocean/cached/QU240/mesh/mesh/culled_graph.info": "global_ocean/QU240/mesh/mesh/culled_graph.210727.info",
    "ocean/global_ocean/cached/QU240/mesh/mesh/critical_passages_mask_final.nc": "global_ocean/QU240/mesh/mesh/critical_passages_mask_final.210727.nc",
    "ocean/global_ocean/cached/QU240/PHC/init/initial_state/initial_state.nc": "global_ocean/QU240/PHC/init/initial_state/initial_state.210727.nc",
    "ocean/global_ocean/cached/QU240/PHC/init/initial_state/init_mode_forcing_data.nc": "global_ocean/QU240/PHC/init/initial_state/init_mode_forcing_data.210727.nc"
}

A new command, compass cache has been added to aid in updating cached_files.json. This command is only available on Anvil and Chrysalis, since you can only copy files from a compass work directory onto the LCRC server from these two machines. You run compass cache from the base work directory, giving the relative paths to the step(s) that you want to cache output files from. For example:

compass cache -i ocean/global_ocean/QU240/mesh/mesh

This will:

copy the output files from the ocean/global_ocean/QU240/mesh/mesh step into the appropriate compass_cache location on the LCRC server and
add these files to a local ocean_cached_files.json that can then be copied to compass/<mpas_core>/cached_files.json in a local compass branch so it is ready for a PR.

If you want, you can provide several steps with the -i flag or just call compass cache several times. In either case, each call will update the local ocean_cached_files.json.

See the design doc and updated documentation on details about setting up test cases and suites with cached outputs.

closes #175

xylar · 2021-07-21T11:48:48Z

@mark-petersen, it seems that chrysalis still isn't working (you can log in but there aren't any useful directories available). I'll upload QU240 and QUwISC240 cached files as soon as it comes back. Other resolutions are too big to upload from my laptop so I'll create them on Chrysalis directly when I get a chance.

I tested the performance tests with QU240 and QUwISC240 and they worked as expected (the mesh and init test cases did nothing, as expected). I didn't test restart, decomposition or thread tests yet but I think they'll work fine if performance did.

Let me know if you have concerns.

Obviously, this needs to be added to the documentation. I'll do that when I get back before this gets merged.

xylar · 2021-07-21T11:56:42Z

If you want to test in the meantime, here's a file you can untar in your local mpas-ocean directory that should currently contain mesh_database, bathymetry_database and initial_condition_database: https://drive.google.com/file/d/1QmSO6Q_l8ngxV6FKkEPMGoHKk_FAgXST/view?usp=sharing

This will give you the QU240 and QUwISC240 cache files.

mark-petersen · 2021-07-22T16:52:06Z

@xylar, thanks, this is great! I'm chairing sessions at SIAM today and tomorrow, will work on this next week.

xylar · 2021-07-27T13:20:41Z

@mark-petersen, I'm working on this. I gave it some thought today. It needs to support date stamps for the cached files, which is not the case with the implementation here. It also could be made much more general, allowing any step from any core to have a cached version, without much trouble. So I will be working on that on the train today. It's about 1/2 done already.

xylar · 2021-07-27T20:28:04Z

Testing

I have tested all cached steps with 5 meshes on Chrysalis with Intel compilers and Intel MPI. All worked as expected (cached mesh, initial_state and ssh_adjustment steps did nothing, and forward tests ran as expected and verification was successful.

I will also test the QU*240 mesh tests under Linux to ensure that downloading of the files from the LCRC server (which doesn't happen on Chrysalis, since they are local) also works as expected. I will update this comment when I have done this testing.

xylar · 2021-07-29T10:55:51Z

@mark-petersen, this is ready to test for all but the SOwISC12to60 test case. I want to get #187 in first, then rebase this and produce the cache files for that mesh as well. But feel free to test and review in the meantime.

xylar · 2021-07-29T12:25:43Z

@mark-petersen, sorry for being a moving target but in the process of writing the documentation for this PR, I realize I'm not happy with the current implementation either. It's very clumsy to have "normal" and "cached" versions of steps (and test cases). Instead, it would make sense for every step to be cached with the same workflow. There should be relatively simple way to specify which steps are cached and which are not. So far, the only reasonable way I can come up with to do that is with different test suites.

I would benefit from having a brief chat whenever you're free. Based on that chat, it might be worth putting this PR on hold and making a proper design doc for this addition. That would give @matthewhoffman (and others who might be interested) a chance to weigh in at the expense of taking quite a bit longer to get finalized and merged. Given that this functionality could have a lot of different uses if it's done properly, I think it's likely worth doing a proper design doc.

pep8speaks · 2021-08-03T13:05:22Z

Hello @xylar! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-08-10 17:29:22 UTC

mark-petersen · 2021-08-09T19:39:57Z

OK. Thanks for all your work so far. This is a fantastic new feature.

xylar · 2021-08-09T21:11:10Z

for some reason this one dies in the performance forward step:
compass setup -w $n/210809_pr_cache_8 -n 40c 41c 42

@mark-petersen, it looks like you must have interrupted the downloading of the file so it's broken:

$ ls -lah /usr/projects/regionalclimate/COMMON_MPAS/ocean/grids/compass_cache/global_ocean/QU240/PHC/init/initial_state/initial_state.210803.nc
-rw-r--r-- 1 mpeterse mpeterse 2.7M Aug  9 08:31 /usr/projects/regionalclimate/COMMON_MPAS/ocean/grids/compass_cache/global_ocean/QU240/PHC/init/initial_state/initial_state.210803.nc

but the same file on the LCRC server is 23 MB:
https://web.lcrc.anl.gov/public/e3sm/mpas_standalonedata/mpas-ocean/compass_cache/global_ocean/QU240/PHC/init/initial_state/

There are 2 options to fix this. The first is just to delete the file (which only you have write permission to) and rerun the test case. The other is to make a user config file and set:

# Options related to downloading files
[download]

# whether to check the size of files that have been downloaded to make sure
# they are the right size
check_size = True

We don't have this as the default because it is very time consuming to check if each downloaded file is the right size each time a test case gets set up. But if you know you're downloading files and want to make sure they're complete, this is how.

By the way, you have also half-downloaded files on Cori in the past, causing us some hassle with the legacy COMPASS. If you're downloading files into a shared space, it's really important to let the downloads complete, or to carefully clean up if you don't. (I guess that's partly why you were trying to figure out where the files end up?)

xylar · 2021-08-09T21:25:13Z

@mark-petersen, it was pretty easy to implement what you wanted in terms of the graph.info in #203. I rebased this on that branch, created new cache files for all test cases except SOwIsC12to60 (still running), and updated the list of cache files.

Please see if you can delete and re-download the cache files on Grizzly. See if you can run without the mesh steps this time. It worked for me.

I'll force-push to update my last commit once SOwISC12to60 results are there for caching.

compass/ocean/cached_files.json

xylar · 2021-08-10T08:39:44Z

@mark-petersen, this is ready for you to re-review along with #203

mark-petersen

I tested performance tests using cached init on grizzly and cori for QU240, QUwISC240, and EC60to30. Everything works great! See details here: #203 (review)

mark-petersen · 2021-08-10T13:33:00Z

Sorry about the half-downloads. I had no idea, but I'll watch for it in the future.

This generates an updated cached_files.json (locally named ocean_cached_files.json or landice_cached_files.json) that lists cached output files, and copies the cached files into the appropriate location on the LCRC server. It can only be run on Anvil or Chrysalis.

When constructing a step, if ``cached=True``, the outputs for this step will be downloaded to the appropriate local database and symlinked instead of being computed. Inputs (other than the cached outputs) and the run method will be skipped. Each MPAS core can optionally have a database (a python dictionary in a json file called cached_files.json) that keeps track of which files are available in the cache and what date stamp is in the filename. When setting up test cases, a user can supply test-case numbers with a "c" suffix to indicate that they should be cached. When setting up test cases individually with a path, a user can supply a list of steps in the test case that should use cached outputs. A test suite can supply a line with "cached" or "cached: <step> <step>" to indicate either that all steps in the test case or the listed steps should use cached outputs.

So far, meshes and initial conditions for tests in the global ocean and global convergence test groups are included.

The latest version includes graph.info files for global ocean test cases.

xylar · 2021-08-10T17:29:49Z

@mark-petersen, thanks again. I rebased after merging #203 and will merge this once tests have passed.

xylar · 2021-08-10T17:31:10Z

Oh, sorry. @matthewhoffman, did you want to do a test run with this branch? It affects the general framework but I'm confident it shouldn't have any unexpected impacts on landice. Still, doesn't hurt to check...

matthewhoffman · 2021-08-11T14:50:34Z

@xylar , I will do that today - thanks.

matthewhoffman · 2021-08-12T14:21:34Z

@xylar , I got some unexpected differences against the baseline yesterday. I suspect they may be due to an outdated baseline and not a problem with this PR. I need to redo my test more carefully today (make sure my baseline is up to date with master, and then re-test this branch). I'm starting that now.

xylar · 2021-08-12T14:24:20Z

@matthewhoffman, thanks for the update. Sounds bets to sort that out. Keep me posted.

matthewhoffman · 2021-08-12T16:44:34Z

My corrected test passed. Go ahead and merge.

xylar · 2021-08-13T06:08:09Z

Thanks very much, @matthewhoffman!

xylar requested a review from mark-petersen July 21, 2021 11:37

xylar self-assigned this Jul 21, 2021

xylar added enhancement New feature or request ocean python package DEPRECATED: PRs and Issues involving the python package (master branch) labels Jul 21, 2021

xylar marked this pull request as draft July 27, 2021 13:18

xylar added the in progress This PR is not ready for review or merging label Jul 27, 2021

xylar force-pushed the cached_init branch 11 times, most recently from d02325c to c9a3176 Compare July 27, 2021 19:26

xylar removed the ocean label Jul 27, 2021

xylar mentioned this pull request Jul 30, 2021

Add a design document for cached output files #189

Merged

xylar force-pushed the cached_init branch from c9a3176 to 9a4af41 Compare August 3, 2021 13:05

xylar force-pushed the cached_init branch from 71c8272 to ee96424 Compare August 9, 2021 19:47

xylar mentioned this pull request Aug 9, 2021

Switch global ocean forward steps to use graph.info from initial state, not mesh #203

Merged

xylar force-pushed the cached_init branch from ee96424 to e1337fc Compare August 9, 2021 21:22

xylar commented Aug 9, 2021

View reviewed changes

compass/ocean/cached_files.json Outdated Show resolved Hide resolved

xylar force-pushed the cached_init branch from e1337fc to d617db3 Compare August 10, 2021 08:28

mark-petersen approved these changes Aug 10, 2021

View reviewed changes

xylar added 6 commits August 10, 2021 19:28

Add "database" of ocean cached outputs

986352c

So far, meshes and initial conditions for tests in the global ocean and global convergence test groups are included.

Add test suite for cosine bell with cached init

4435b27

Update docs to include cached outputs

f93a1f6

Update database of ocean cached files

0fa5a4c

The latest version includes graph.info files for global ocean test cases.

xylar force-pushed the cached_init branch from d617db3 to 0fa5a4c Compare August 10, 2021 17:29

matthewhoffman approved these changes Aug 12, 2021

View reviewed changes

xylar mentioned this pull request Aug 13, 2021

Remove critical_passages stream from initial_state in global_ocean test group #207

Merged

xylar merged commit 36166e7 into MPAS-Dev:master Aug 13, 2021

xylar deleted the cached_init branch August 13, 2021 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the capability to use cached meshes and ICs #184

Add the capability to use cached meshes and ICs #184

xylar commented Jul 21, 2021 •

edited

Loading

xylar commented Jul 21, 2021

xylar commented Jul 21, 2021 •

edited

Loading

mark-petersen commented Jul 22, 2021

xylar commented Jul 27, 2021

xylar commented Jul 27, 2021

xylar commented Jul 29, 2021

xylar commented Jul 29, 2021

pep8speaks commented Aug 3, 2021 •

edited

Loading

mark-petersen commented Aug 9, 2021

xylar commented Aug 9, 2021

xylar commented Aug 9, 2021

xylar commented Aug 10, 2021

mark-petersen left a comment

mark-petersen commented Aug 10, 2021

xylar commented Aug 10, 2021

xylar commented Aug 10, 2021

matthewhoffman commented Aug 11, 2021

matthewhoffman commented Aug 12, 2021

xylar commented Aug 12, 2021

matthewhoffman commented Aug 12, 2021

xylar commented Aug 13, 2021

Add the capability to use cached meshes and ICs #184

Add the capability to use cached meshes and ICs #184

Conversation

xylar commented Jul 21, 2021 • edited Loading

xylar commented Jul 21, 2021

xylar commented Jul 21, 2021 • edited Loading

mark-petersen commented Jul 22, 2021

xylar commented Jul 27, 2021

xylar commented Jul 27, 2021

Testing

xylar commented Jul 29, 2021

xylar commented Jul 29, 2021

pep8speaks commented Aug 3, 2021 • edited Loading

Comment last updated at 2021-08-10 17:29:22 UTC

mark-petersen commented Aug 9, 2021

xylar commented Aug 9, 2021

xylar commented Aug 9, 2021

xylar commented Aug 10, 2021

mark-petersen left a comment

Choose a reason for hiding this comment

mark-petersen commented Aug 10, 2021

xylar commented Aug 10, 2021

xylar commented Aug 10, 2021

matthewhoffman commented Aug 11, 2021

matthewhoffman commented Aug 12, 2021

xylar commented Aug 12, 2021

matthewhoffman commented Aug 12, 2021

xylar commented Aug 13, 2021

xylar commented Jul 21, 2021 •

edited

Loading

xylar commented Jul 21, 2021 •

edited

Loading

pep8speaks commented Aug 3, 2021 •

edited

Loading