-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Features/cache archives #579
Conversation
…hStrategy check the cache and skip the download if the source is available there.
…ake sure that an archive without a checksum isnt placed there (this wouldn't cause an error but does waste space and might be confusing)
@@ -46,6 +46,8 @@ | |||
stage_path = join_path(var_path, "stage") | |||
repos_path = join_path(var_path, "repos") | |||
share_path = join_path(spack_root, "share", "spack") | |||
cache_path = join_path(spack_root, "var", "cache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this to var/spack/cache
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I'll continue along the line suggested in the comments (assuming that is of interest). I won't get a chance to keep going with it until Monday. Thanks! |
Thanks! This actually looks simpler than I thought it was going to be. The mirror/stage/fetch logic is kind of convoluted b/c it started out as one class. But thanks for hacking on it! |
…as a mirror. this should support both URLFetchStrategy as well as VCSFetchStrategy (the previous strategy supported only the former). this won't work until URLFetchStrategy.archive is updated
…ll causes a failure on the initial download)
I'm curious if it is acceptable to change the expectations for the FetchStrategy.archive method: in particular that it copy resources vs. move them. Latest commits are an attempt at the above suggestions (it depends on the suggested change and there is a TODO added there to that effect). |
@scheibelp: I'm ok with making it copy instead of move. I think that would be fine. I guess one question would be whether it makes sense to have the URLFetchStrategy fetch straight into the cache and untar from the cache. I think that might be good and would help LC save some space. That might take more work, though, so just copying into the cache would be a good start. |
…tests (just for git tests although the same concept applies to the other unit tests which are failing - namely those for svn and hg)
I'll stick with that for now I need to find a cleaner way to fix the failing unit tests (currently git/hg/svn_fetch although the last commit handles git_fetch) - perhaps move the cleanup I added in git_fetch.teardown to MockPackagesTest? Tests which don't inherit that and directly involve the fetching/repo management logic would be fragile. |
…got a better way to avoid test fragility but Ill add this for now)
…ject. tests may assign a mock cache but by default it is None (this will avoid any implicit caching behavior confusing unit tests)
OK a bit of cleanup is needed but I think the latest updates reduce test fragility: same approach as before but cache users now refer to it as an object, so the test code can replace the cache with a mock. |
…xtra work is required in MockPackagesTest) (2) removing outdated logic (which originated in this branch) and minor cleanup
@tgamblin you asked if the implementation would support caching for ResourceStage: I added cache_local as a method to Stage and StageComposite (using the composite decorator) so this should be the case. DIY stage resources are not cached (I think that is possible but didn't think it was needed) Latest commits improve robustness of tests (as of yesterday tests attempted to load from filesystem cache but now the cache provides a fetcher which can be mocked) |
I think this will be a really nice feature, with very little downside. HOWEVER... it struck me that we cannot trust that upstream authors won't change their tarballs while keeping the same name (I know, bad practice). Or they might rename the tarball while keeping the file contents the same. Another problem is, what if two websites share different tarballs but named the same? Or if one website distributes their tarballs with a really generic name, like 'download.tar.gz'? For all those reasons... it seems that the "right" way to store things in the cache is by hashcode. I know it's cryptic, so put a user-readable part into the filename as well. For example... if a website provides mylib-1.2.1.tar.gz, then it would go into the cache under a filename like:
|
If by hash you mean the digest in a package.py version directive then I think I understand how this would fix things. @tgamblin I think the appropriate way to implement that is modifying mirror_archive_filename to include the digest - does that sound agreeable? In more detail:
This would be the case I imagine would be frustrating with the caching logic that I've implemented so far: it would successfully download and then the install would fail complaining about mismatched checksums.
This would potentially cause an unnecessary extra download but wouldn't behave incorrectly. I don't see how your proposal addresses it although IMO it is OK not to handle it.
The current mirror logic creates a separate directory for each package
This would result in the same error as the first case mentioned. Thanks! |
…g. if resource maintainer uses same name for each new version of a package)
ops wrong button -- accidentally closed. reopening. |
…names from run to run (as long as the available digests do not change)
This is a side effect: My primary goal is to avoid the cache erroneously succeeding with an old checksum (since that short-circuits the fetching logic). A couple other possibilities to achieve this:
The first alternative avoids preserving archives that will likely never be used.
This doesnt interfere with checksum verification (moreover the archive is not cached unless the verification succeeds)
done
done (added 'digests' property to Package, removed 'file_hash' property from FetchStrategy) |
This PR is of interest to me. Currently, I need to install 24 different variants for each version of HDF5 so not having to re-download it would be great. There are a few points that I am confused on. Does this just create a default mirror within Spack, so that whenever you download something, it first checks the mirror, and if it's not their it downloads it from the url and stores it in the mirror? If so that would be amazing. I'm currently working on a few packages like PGI and CUDA that need to be downloaded by hand. If I could run a one-line Spack command to add them to this default repo it would save me a lot of time. Of course, one of the problems with these packages is that they have different files for each OS/arch, each with different names and md5s. This may be outside the scope of this PR, but it should be something to keep in mind. |
Yes
That would be cool but I'd prefer to do it to a separate PR. I'm thinking a command could take a spec and a file and place that file in the default mirror. Actually come to think of it all you should have to do is provide the file and the package name (vs. full spec with version etc.) since Spack could automatically match the file hash to the version. Does that sound reasonable?
Perhaps this could be encoded into the version? |
@scheibelp That sounds good to me. I'll save these ideas for a later PR. |
-- Elizabeth |
I should provide an update on this: at the 3/31 Spack meeting I was expanding on my above response to Todd:
To elaborate on my issue I was attempting to address (by adding hashes to the mirror archive filename) I'll provide a concrete scenario:
Without using the digest to create the mirror archive filename, the last step will retrieve the out-of-date file from the mirror and then fail the installation with a checksum error. Using the hash as part of the filename has other implications though like persisting the archives with the old hash; this implies they would be useful at some point in the future, which IMO is not the case (and if that were true they likely ought to be encoded as versions in the package.py file). I promoted switching to an approach I discussed earlier:
Todd mentioned he wanted to think on that so I've held off on further development in the meantime. FWIW I think this will work and that there are no other issues in the way of merging this PR. |
Any progress on this? The server for MUMPS went down today, I don't know for how long. I found another URL for the same tarball and kept going by changing the MUMPS package.py. But this feature, turned on routinely, would have saved me a bit of grief. |
No: sorry. Let me check in w/ Todd. It slipped my mind to ask him about it the last couple times we've talked. |
and the |
…lculate and compare the checksum. This achieves the original goal of discarding stale cache files without preserving multiple files for the same version.
… [1] to conditionally cache resource: only save it if there is a feature which identifies it uniquely (for example do not cache a repository if it pulls the latest state vs. a particular tag/commit)
…used to manage all mirror URLs - just the cache (the specific behavior that a URL may refer to a stale resource doesn't necessarily apply to mirrors)
This creates a cache directory for source archives under var/cache/ and avoids re-downloading source archives for package installs that use the same source (normally e.g. installing the same package with different build options would trigger a re-download). This is likely to primarily be useful to people testing several different build configurations of the same package.
This does not deal with caching for source control repositories (i.e. does not modify VCSFetchStrategy).
EDIT (3/23): when this PR was started it did not handle caching VCSFetchingStrategy but has since been edited to support them.