Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the capability to use cached meshes and ICs #184

Merged
merged 6 commits into from
Aug 13, 2021

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Jul 21, 2021

This merge adds a new capability for steps to have cached outputs in a compass_cache database. Files in this database have a directory structure similar to the work directory (but without the MPAS core subdirectory, which is redundant). The files include a date stamp so that new revisions can be added without removing older ones (supported by older compass versions).

As an example, here are cached files for the mesh and init test cases for the QU240 mesh:

compass_cache/global_ocean/QU240/mesh/mesh/critical_passages_mask_final.210727.nc
compass_cache/global_ocean/QU240/mesh/mesh/culled_graph.210727.info
compass_cache/global_ocean/QU240/mesh/mesh/culled_mesh.210727.nc
compass_cache/global_ocean/QU240/PHC/init/initial_state/init_mode_forcing_data.210727.nc
compass_cache/global_ocean/QU240/PHC/init/initial_state/initial_state.210727.nc

The mapping between outputs in the cached versions of steps and the files in the compass_cache database are maintained in a file in compass.<mpas_core> called cached_files.json. This file contains a python dictionary that maps from the output files in the cached versions of each step (the symlinks) to those in the database (the targets). For example:

{
    "ocean/global_ocean/cached/QU240/mesh/mesh/culled_mesh.nc": "global_ocean/QU240/mesh/mesh/culled_mesh.210727.nc",
    "ocean/global_ocean/cached/QU240/mesh/mesh/culled_graph.info": "global_ocean/QU240/mesh/mesh/culled_graph.210727.info",
    "ocean/global_ocean/cached/QU240/mesh/mesh/critical_passages_mask_final.nc": "global_ocean/QU240/mesh/mesh/critical_passages_mask_final.210727.nc",
    "ocean/global_ocean/cached/QU240/PHC/init/initial_state/initial_state.nc": "global_ocean/QU240/PHC/init/initial_state/initial_state.210727.nc",
    "ocean/global_ocean/cached/QU240/PHC/init/initial_state/init_mode_forcing_data.nc": "global_ocean/QU240/PHC/init/initial_state/init_mode_forcing_data.210727.nc"
}

A new command, compass cache has been added to aid in updating cached_files.json. This command is only available on Anvil and Chrysalis, since you can only copy files from a compass work directory onto the LCRC server from these two machines. You run compass cache from the base work directory, giving the relative paths to the step(s) that you want to cache output files from. For example:

compass cache -i ocean/global_ocean/QU240/mesh/mesh 

This will:

  1. copy the output files from the ocean/global_ocean/QU240/mesh/mesh step into the appropriate compass_cache location on the LCRC server and
  2. add these files to a local ocean_cached_files.json that can then be copied to compass/<mpas_core>/cached_files.json in a local compass branch so it is ready for a PR.

If you want, you can provide several steps with the -i flag or just call compass cache several times. In either case, each call will update the local ocean_cached_files.json.

See the design doc and updated documentation on details about setting up test cases and suites with cached outputs.

closes #175

@xylar xylar requested a review from mark-petersen July 21, 2021 11:37
@xylar xylar self-assigned this Jul 21, 2021
@xylar
Copy link
Collaborator Author

xylar commented Jul 21, 2021

@mark-petersen, it seems that chrysalis still isn't working (you can log in but there aren't any useful directories available). I'll upload QU240 and QUwISC240 cached files as soon as it comes back. Other resolutions are too big to upload from my laptop so I'll create them on Chrysalis directly when I get a chance.

I tested the performance tests with QU240 and QUwISC240 and they worked as expected (the mesh and init test cases did nothing, as expected). I didn't test restart, decomposition or thread tests yet but I think they'll work fine if performance did.

Let me know if you have concerns.

Obviously, this needs to be added to the documentation. I'll do that when I get back before this gets merged.

@xylar xylar added enhancement New feature or request ocean python package DEPRECATED: PRs and Issues involving the python package (master branch) labels Jul 21, 2021
@xylar
Copy link
Collaborator Author

xylar commented Jul 21, 2021

If you want to test in the meantime, here's a file you can untar in your local mpas-ocean directory that should currently contain mesh_database, bathymetry_database and initial_condition_database: https://drive.google.com/file/d/1QmSO6Q_l8ngxV6FKkEPMGoHKk_FAgXST/view?usp=sharing

This will give you the QU240 and QUwISC240 cache files.

@mark-petersen
Copy link
Collaborator

@xylar, thanks, this is great! I'm chairing sessions at SIAM today and tomorrow, will work on this next week.

@xylar xylar marked this pull request as draft July 27, 2021 13:18
@xylar xylar added the in progress This PR is not ready for review or merging label Jul 27, 2021
@xylar
Copy link
Collaborator Author

xylar commented Jul 27, 2021

@mark-petersen, I'm working on this. I gave it some thought today. It needs to support date stamps for the cached files, which is not the case with the implementation here. It also could be made much more general, allowing any step from any core to have a cached version, without much trouble. So I will be working on that on the train today. It's about 1/2 done already.

@xylar xylar force-pushed the cached_init branch 11 times, most recently from d02325c to c9a3176 Compare July 27, 2021 19:26
@xylar xylar removed the ocean label Jul 27, 2021
@xylar
Copy link
Collaborator Author

xylar commented Jul 27, 2021

Testing

I have tested all cached steps with 5 meshes on Chrysalis with Intel compilers and Intel MPI. All worked as expected (cached mesh, initial_state and ssh_adjustment steps did nothing, and forward tests ran as expected and verification was successful.

I will also test the QU*240 mesh tests under Linux to ensure that downloading of the files from the LCRC server (which doesn't happen on Chrysalis, since they are local) also works as expected. I will update this comment when I have done this testing.

@xylar
Copy link
Collaborator Author

xylar commented Jul 29, 2021

@mark-petersen, this is ready to test for all but the SOwISC12to60 test case. I want to get #187 in first, then rebase this and produce the cache files for that mesh as well. But feel free to test and review in the meantime.

@xylar
Copy link
Collaborator Author

xylar commented Jul 29, 2021

@mark-petersen, sorry for being a moving target but in the process of writing the documentation for this PR, I realize I'm not happy with the current implementation either. It's very clumsy to have "normal" and "cached" versions of steps (and test cases). Instead, it would make sense for every step to be cached with the same workflow. There should be relatively simple way to specify which steps are cached and which are not. So far, the only reasonable way I can come up with to do that is with different test suites.

I would benefit from having a brief chat whenever you're free. Based on that chat, it might be worth putting this PR on hold and making a proper design doc for this addition. That would give @matthewhoffman (and others who might be interested) a chance to weigh in at the expense of taking quite a bit longer to get finalized and merged. Given that this functionality could have a lot of different uses if it's done properly, I think it's likely worth doing a proper design doc.

@pep8speaks
Copy link

pep8speaks commented Aug 3, 2021

Hello @xylar! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-08-10 17:29:22 UTC

@mark-petersen
Copy link
Collaborator

OK. Thanks for all your work so far. This is a fantastic new feature.

@xylar
Copy link
Collaborator Author

xylar commented Aug 9, 2021

for some reason this one dies in the performance forward step:
compass setup -w $n/210809_pr_cache_8 -n 40c 41c 42

@mark-petersen, it looks like you must have interrupted the downloading of the file so it's broken:

$ ls -lah /usr/projects/regionalclimate/COMMON_MPAS/ocean/grids/compass_cache/global_ocean/QU240/PHC/init/initial_state/initial_state.210803.nc
-rw-r--r-- 1 mpeterse mpeterse 2.7M Aug  9 08:31 /usr/projects/regionalclimate/COMMON_MPAS/ocean/grids/compass_cache/global_ocean/QU240/PHC/init/initial_state/initial_state.210803.nc

but the same file on the LCRC server is 23 MB:
https://web.lcrc.anl.gov/public/e3sm/mpas_standalonedata/mpas-ocean/compass_cache/global_ocean/QU240/PHC/init/initial_state/

There are 2 options to fix this. The first is just to delete the file (which only you have write permission to) and rerun the test case. The other is to make a user config file and set:

# Options related to downloading files
[download]

# whether to check the size of files that have been downloaded to make sure
# they are the right size
check_size = True

We don't have this as the default because it is very time consuming to check if each downloaded file is the right size each time a test case gets set up. But if you know you're downloading files and want to make sure they're complete, this is how.

By the way, you have also half-downloaded files on Cori in the past, causing us some hassle with the legacy COMPASS. If you're downloading files into a shared space, it's really important to let the downloads complete, or to carefully clean up if you don't. (I guess that's partly why you were trying to figure out where the files end up?)

@xylar
Copy link
Collaborator Author

xylar commented Aug 9, 2021

@mark-petersen, it was pretty easy to implement what you wanted in terms of the graph.info in #203. I rebased this on that branch, created new cache files for all test cases except SOwIsC12to60 (still running), and updated the list of cache files.

Please see if you can delete and re-download the cache files on Grizzly. See if you can run without the mesh steps this time. It worked for me.

I'll force-push to update my last commit once SOwISC12to60 results are there for caching.

compass/ocean/cached_files.json Outdated Show resolved Hide resolved
@xylar
Copy link
Collaborator Author

xylar commented Aug 10, 2021

@mark-petersen, this is ready for you to re-review along with #203

Copy link
Collaborator

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested performance tests using cached init on grizzly and cori for QU240, QUwISC240, and EC60to30. Everything works great! See details here: #203 (review)

@mark-petersen
Copy link
Collaborator

Sorry about the half-downloads. I had no idea, but I'll watch for it in the future.

xylar added 6 commits August 10, 2021 19:28
This generates an updated cached_files.json (locally named
ocean_cached_files.json or landice_cached_files.json) that lists
cached output files, and copies the cached files into the appropriate
location on the LCRC server.  It can only be run on Anvil or
Chrysalis.
When constructing a step, if ``cached=True``, the outputs for
this step will be downloaded to the appropriate local database and
symlinked instead of being computed.  Inputs (other than the cached
outputs) and the run method will be skipped.

Each MPAS core can optionally have a database (a python dictionary
in a json file called cached_files.json) that keeps track of which
files are available in the cache and what date stamp is in the
filename.

When setting up test cases, a user can supply test-case numbers
with a "c" suffix to indicate that they should be cached.  When
setting up test cases individually with a path, a user can supply
a list of steps in the test case that should use cached outputs.
A test suite can supply a line with "cached" or
"cached: <step> <step>" to indicate either that all steps in the
test case or the listed steps should use cached outputs.
So far, meshes and initial conditions for tests in the global
ocean and global convergence test groups are included.
The latest version includes graph.info files for global ocean
test cases.
@xylar
Copy link
Collaborator Author

xylar commented Aug 10, 2021

@mark-petersen, thanks again. I rebased after merging #203 and will merge this once tests have passed.

@xylar
Copy link
Collaborator Author

xylar commented Aug 10, 2021

Oh, sorry. @matthewhoffman, did you want to do a test run with this branch? It affects the general framework but I'm confident it shouldn't have any unexpected impacts on landice. Still, doesn't hurt to check...

@matthewhoffman
Copy link
Member

@xylar , I will do that today - thanks.

@matthewhoffman
Copy link
Member

@xylar , I got some unexpected differences against the baseline yesterday. I suspect they may be due to an outdated baseline and not a problem with this PR. I need to redo my test more carefully today (make sure my baseline is up to date with master, and then re-test this branch). I'm starting that now.

@xylar
Copy link
Collaborator Author

xylar commented Aug 12, 2021

@matthewhoffman, thanks for the update. Sounds bets to sort that out. Keep me posted.

@matthewhoffman
Copy link
Member

My corrected test passed. Go ahead and merge.

@xylar
Copy link
Collaborator Author

xylar commented Aug 13, 2021

Thanks very much, @matthewhoffman!

@xylar xylar merged commit 36166e7 into MPAS-Dev:master Aug 13, 2021
@xylar xylar deleted the cached_init branch August 13, 2021 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python package DEPRECATED: PRs and Issues involving the python package (master branch)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to run tests from cached initial conditions
4 participants