-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the capability to use cached meshes and ICs #184
Conversation
@mark-petersen, it seems that chrysalis still isn't working (you can log in but there aren't any useful directories available). I'll upload QU240 and QUwISC240 cached files as soon as it comes back. Other resolutions are too big to upload from my laptop so I'll create them on Chrysalis directly when I get a chance. I tested the performance tests with QU240 and QUwISC240 and they worked as expected (the mesh and init test cases did nothing, as expected). I didn't test restart, decomposition or thread tests yet but I think they'll work fine if performance did. Let me know if you have concerns. Obviously, this needs to be added to the documentation. I'll do that when I get back before this gets merged. |
If you want to test in the meantime, here's a file you can untar in your local This will give you the QU240 and QUwISC240 cache files. |
@xylar, thanks, this is great! I'm chairing sessions at SIAM today and tomorrow, will work on this next week. |
@mark-petersen, I'm working on this. I gave it some thought today. It needs to support date stamps for the cached files, which is not the case with the implementation here. It also could be made much more general, allowing any step from any core to have a cached version, without much trouble. So I will be working on that on the train today. It's about 1/2 done already. |
d02325c
to
c9a3176
Compare
TestingI have tested all cached steps with 5 meshes on Chrysalis with Intel compilers and Intel MPI. All worked as expected (cached I will also test the QU*240 mesh tests under Linux to ensure that downloading of the files from the LCRC server (which doesn't happen on Chrysalis, since they are local) also works as expected. I will update this comment when I have done this testing. |
@mark-petersen, this is ready to test for all but the SOwISC12to60 test case. I want to get #187 in first, then rebase this and produce the cache files for that mesh as well. But feel free to test and review in the meantime. |
@mark-petersen, sorry for being a moving target but in the process of writing the documentation for this PR, I realize I'm not happy with the current implementation either. It's very clumsy to have "normal" and "cached" versions of steps (and test cases). Instead, it would make sense for every step to be cached with the same workflow. There should be relatively simple way to specify which steps are cached and which are not. So far, the only reasonable way I can come up with to do that is with different test suites. I would benefit from having a brief chat whenever you're free. Based on that chat, it might be worth putting this PR on hold and making a proper design doc for this addition. That would give @matthewhoffman (and others who might be interested) a chance to weigh in at the expense of taking quite a bit longer to get finalized and merged. Given that this functionality could have a lot of different uses if it's done properly, I think it's likely worth doing a proper design doc. |
OK. Thanks for all your work so far. This is a fantastic new feature. |
@mark-petersen, it looks like you must have interrupted the downloading of the file so it's broken:
but the same file on the LCRC server is 23 MB: There are 2 options to fix this. The first is just to delete the file (which only you have write permission to) and rerun the test case. The other is to make a user config file and set:
We don't have this as the default because it is very time consuming to check if each downloaded file is the right size each time a test case gets set up. But if you know you're downloading files and want to make sure they're complete, this is how. By the way, you have also half-downloaded files on Cori in the past, causing us some hassle with the legacy COMPASS. If you're downloading files into a shared space, it's really important to let the downloads complete, or to carefully clean up if you don't. (I guess that's partly why you were trying to figure out where the files end up?) |
@mark-petersen, it was pretty easy to implement what you wanted in terms of the Please see if you can delete and re-download the cache files on Grizzly. See if you can run without the I'll force-push to update my last commit once |
@mark-petersen, this is ready for you to re-review along with #203 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested performance tests using cached init on grizzly and cori for QU240, QUwISC240, and EC60to30. Everything works great! See details here: #203 (review)
Sorry about the half-downloads. I had no idea, but I'll watch for it in the future. |
This generates an updated cached_files.json (locally named ocean_cached_files.json or landice_cached_files.json) that lists cached output files, and copies the cached files into the appropriate location on the LCRC server. It can only be run on Anvil or Chrysalis.
When constructing a step, if ``cached=True``, the outputs for this step will be downloaded to the appropriate local database and symlinked instead of being computed. Inputs (other than the cached outputs) and the run method will be skipped. Each MPAS core can optionally have a database (a python dictionary in a json file called cached_files.json) that keeps track of which files are available in the cache and what date stamp is in the filename. When setting up test cases, a user can supply test-case numbers with a "c" suffix to indicate that they should be cached. When setting up test cases individually with a path, a user can supply a list of steps in the test case that should use cached outputs. A test suite can supply a line with "cached" or "cached: <step> <step>" to indicate either that all steps in the test case or the listed steps should use cached outputs.
So far, meshes and initial conditions for tests in the global ocean and global convergence test groups are included.
The latest version includes graph.info files for global ocean test cases.
@mark-petersen, thanks again. I rebased after merging #203 and will merge this once tests have passed. |
Oh, sorry. @matthewhoffman, did you want to do a test run with this branch? It affects the general framework but I'm confident it shouldn't have any unexpected impacts on |
@xylar , I will do that today - thanks. |
@xylar , I got some unexpected differences against the baseline yesterday. I suspect they may be due to an outdated baseline and not a problem with this PR. I need to redo my test more carefully today (make sure my baseline is up to date with master, and then re-test this branch). I'm starting that now. |
@matthewhoffman, thanks for the update. Sounds bets to sort that out. Keep me posted. |
My corrected test passed. Go ahead and merge. |
Thanks very much, @matthewhoffman! |
This merge adds a new capability for steps to have cached outputs in a
compass_cache
database. Files in this database have a directory structure similar to the work directory (but without the MPAS core subdirectory, which is redundant). The files include a date stamp so that new revisions can be added without removing older ones (supported by oldercompass
versions).As an example, here are cached files for the
mesh
andinit
test cases for theQU240
mesh:The mapping between outputs in the cached versions of steps and the files in the
compass_cache
database are maintained in a file incompass.<mpas_core>
calledcached_files.json
. This file contains a python dictionary that maps from the output files in the cached versions of each step (the symlinks) to those in the database (the targets). For example:A new command,
compass cache
has been added to aid in updatingcached_files.json
. This command is only available on Anvil and Chrysalis, since you can only copy files from a compass work directory onto the LCRC server from these two machines. You runcompass cache
from the base work directory, giving the relative paths to the step(s) that you want to cache output files from. For example:This will:
ocean/global_ocean/QU240/mesh/mesh
step into the appropriatecompass_cache
location on the LCRC server andocean_cached_files.json
that can then be copied tocompass/<mpas_core>/cached_files.json
in a local compass branch so it is ready for a PR.If you want, you can provide several steps with the
-i
flag or just callcompass cache
several times. In either case, each call will update the localocean_cached_files.json
.See the design doc and updated documentation on details about setting up test cases and suites with cached outputs.
closes #175