Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/cache archives #579

Merged
merged 28 commits into from
Jun 29, 2016
Merged

Conversation

scheibelp
Copy link
Member

This creates a cache directory for source archives under var/cache/ and avoids re-downloading source archives for package installs that use the same source (normally e.g. installing the same package with different build options would trigger a re-download). This is likely to primarily be useful to people testing several different build configurations of the same package.

This does not deal with caching for source control repositories (i.e. does not modify VCSFetchStrategy).

EDIT (3/23): when this PR was started it did not handle caching VCSFetchingStrategy but has since been edited to support them.

…hStrategy check the cache and skip the download if the source is available there.
…ake sure that an archive without a checksum isnt placed there (this wouldn't cause an error but does waste space and might be confusing)
@@ -46,6 +46,8 @@
stage_path = join_path(var_path, "stage")
repos_path = join_path(var_path, "repos")
share_path = join_path(spack_root, "share", "spack")
cache_path = join_path(spack_root, "var", "cache")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this to var/spack/cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@scheibelp
Copy link
Member Author

I'll continue along the line suggested in the comments (assuming that is of interest). I won't get a chance to keep going with it until Monday.

Thanks!

@tgamblin
Copy link
Member

Thanks! This actually looks simpler than I thought it was going to be. The mirror/stage/fetch logic is kind of convoluted b/c it started out as one class. But thanks for hacking on it!

…as a mirror. this should support both URLFetchStrategy as well as VCSFetchStrategy (the previous strategy supported only the former). this won't work until URLFetchStrategy.archive is updated
…ll causes a failure on the initial download)
@scheibelp
Copy link
Member Author

I'm curious if it is acceptable to change the expectations for the FetchStrategy.archive method: in particular that it copy resources vs. move them. Latest commits are an attempt at the above suggestions (it depends on the suggested change and there is a TODO added there to that effect).

@tgamblin
Copy link
Member

@scheibelp: I'm ok with making it copy instead of move. I think that would be fine.

I guess one question would be whether it makes sense to have the URLFetchStrategy fetch straight into the cache and untar from the cache. I think that might be good and would help LC save some space. That might take more work, though, so just copying into the cache would be a good start.

…tests (just for git tests although the same concept applies to the other unit tests which are failing - namely those for svn and hg)
@scheibelp
Copy link
Member Author

just copying into the cache would be a good start.

I'll stick with that for now

I need to find a cleaner way to fix the failing unit tests (currently git/hg/svn_fetch although the last commit handles git_fetch) - perhaps move the cleanup I added in git_fetch.teardown to MockPackagesTest? Tests which don't inherit that and directly involve the fetching/repo management logic would be fragile.

…got a better way to avoid test fragility but Ill add this for now)
…ject. tests may assign a mock cache but by default it is None (this will avoid any implicit caching behavior confusing unit tests)
@scheibelp
Copy link
Member Author

OK a bit of cleanup is needed but I think the latest updates reduce test fragility: same approach as before but cache users now refer to it as an object, so the test code can replace the cache with a mock.

@scheibelp
Copy link
Member Author

@tgamblin you asked if the implementation would support caching for ResourceStage: I added cache_local as a method to Stage and StageComposite (using the composite decorator) so this should be the case.

DIY stage resources are not cached (I think that is possible but didn't think it was needed)

Latest commits improve robustness of tests (as of yesterday tests attempted to load from filesystem cache but now the cache provides a fetcher which can be mocked)

@citibeth
Copy link
Member

I think this will be a really nice feature, with very little downside.

HOWEVER... it struck me that we cannot trust that upstream authors won't change their tarballs while keeping the same name (I know, bad practice). Or they might rename the tarball while keeping the file contents the same.

Another problem is, what if two websites share different tarballs but named the same? Or if one website distributes their tarballs with a really generic name, like 'download.tar.gz'?

For all those reasons... it seems that the "right" way to store things in the cache is by hashcode. I know it's cryptic, so put a user-readable part into the filename as well.

For example... if a website provides mylib-1.2.1.tar.gz, then it would go into the cache under a filename like:

14af3293442fb99-mylib-1.2.1.tar.gz

@scheibelp
Copy link
Member Author

If by hash you mean the digest in a package.py version directive then I think I understand how this would fix things.

@tgamblin I think the appropriate way to implement that is modifying mirror_archive_filename to include the digest - does that sound agreeable?

In more detail:

it struck me that we cannot trust that upstream authors won't change their tarballs while keeping the same name (I know, bad practice).

This would be the case I imagine would be frustrating with the caching logic that I've implemented so far: it would successfully download and then the install would fail complaining about mismatched checksums.

Or they might rename the tarball while keeping the file contents the same.

This would potentially cause an unnecessary extra download but wouldn't behave incorrectly. I don't see how your proposal addresses it although IMO it is OK not to handle it.

Another problem is, what if two websites share different tarballs but named the same?

The current mirror logic creates a separate directory for each package

Or if one website distributes their tarballs with a really generic name, like 'download.tar.gz'?

This would result in the same error as the first case mentioned.

Thanks!

…g. if resource maintainer uses same name for each new version of a package)
@tgamblin
Copy link
Member

ops wrong button -- accidentally closed. reopening.

@scheibelp
Copy link
Member Author

So the intent here is to be able to store multiple versions of an archive in a mirror when the same version has been released multiple times?

This is a side effect: My primary goal is to avoid the cache erroneously succeeding with an old checksum (since that short-circuits the fetching logic). A couple other possibilities to achieve this:

  • Mirrors could use a subclass of URLFetchStrategy that does checksum verification as part of the download
  • It could instead be expected that the package.py maintainer add new versions to identify the hash changes (in which case the latest commits would not be required).

The first alternative avoids preserving archives that will likely never be used.

Obviously, the hash should still be verified, though.

This doesnt interfere with checksum verification (moreover the archive is not cached unless the verification succeeds)

it would be nice if the human-readable part of the archive came first

done

I think it would also be good if the cipher name were included in the filename

done (added 'digests' property to Package, removed 'file_hash' property from FetchStrategy)

@adamjstewart
Copy link
Member

This PR is of interest to me. Currently, I need to install 24 different variants for each version of HDF5 so not having to re-download it would be great.

There are a few points that I am confused on. Does this just create a default mirror within Spack, so that whenever you download something, it first checks the mirror, and if it's not their it downloads it from the url and stores it in the mirror? If so that would be amazing.

I'm currently working on a few packages like PGI and CUDA that need to be downloaded by hand. If I could run a one-line Spack command to add them to this default repo it would save me a lot of time. Of course, one of the problems with these packages is that they have different files for each OS/arch, each with different names and md5s. This may be outside the scope of this PR, but it should be something to keep in mind.

@scheibelp
Copy link
Member Author

Does this just create a default mirror within Spack, so that whenever you download something, it first checks the mirror, and if it's not their it downloads it from the url and stores it in the mirror?

Yes

I'm currently working on a few packages like PGI and CUDA that need to be downloaded by hand. If I could run a one-line Spack command to add them to this default repo it would save me a lot of time.

That would be cool but I'd prefer to do it to a separate PR. I'm thinking a command could take a spec and a file and place that file in the default mirror. Actually come to think of it all you should have to do is provide the file and the package name (vs. full spec with version etc.) since Spack could automatically match the file hash to the version. Does that sound reasonable?

Of course, one of the problems with these packages is that they have different files for each OS/arch, each with different names and md5s.

Perhaps this could be encoded into the version?

@adamjstewart
Copy link
Member

@scheibelp That sounds good to me. I'll save these ideas for a later PR.

@citibeth
Copy link
Member

citibeth commented Apr 1, 2016

I'm currently working on a few packages like PGI and CUDA that need to be
downloaded by hand. If I could run a one-line Spack command to add them to
this default repo it would save me a lot of time.

That would be cool but I'd prefer to do it to a separate PR. I'm thinking
a command could take a spec and a file and place that file in the default
mirror. Actually come to think of it all you should have to do is provide
the file and the package name (vs. full spec with version etc.) since Spack
could automatically match the file hash to the version. Does that sound
reasonable?

Sounds good to me. However... we will want to keep Spack-downloaded files
separate from these handcrafted-downloaded files. The first we'll be
willing to blow away from time to time, and the second we'll want to keep
around and backed up.

-- Elizabeth

@scheibelp
Copy link
Member Author

I should provide an update on this: at the 3/31 Spack meeting I was expanding on my above response to Todd:

So the intent here is to be able to store multiple versions of an archive in a mirror when the same version has been released multiple times?

This is a side effect: My primary goal is to avoid the cache erroneously succeeding with an old checksum (since that short-circuits the fetching logic). A couple other possibilities to achieve this:

To elaborate on my issue I was attempting to address (by adding hashes to the mirror archive filename) I'll provide a concrete scenario:

  • User installs package X with archive file named Y and contents Z
  • Archive Y changes to contents Z' with a new hash (i.e. no change in version or archive name)
  • User updates package.py for X with new digest (replacing old digest with new)
  • User tries installing package X

Without using the digest to create the mirror archive filename, the last step will retrieve the out-of-date file from the mirror and then fail the installation with a checksum error. Using the hash as part of the filename has other implications though like persisting the archives with the old hash; this implies they would be useful at some point in the future, which IMO is not the case (and if that were true they likely ought to be encoded as versions in the package.py file). I promoted switching to an approach I discussed earlier:

Mirrors could use a subclass of URLFetchStrategy that does checksum verification as part of the download

Todd mentioned he wanted to think on that so I've held off on further development in the meantime. FWIW I think this will work and that there are no other issues in the way of merging this PR.

@citibeth
Copy link
Member

Any progress on this? The server for MUMPS went down today, I don't know for how long. I found another URL for the same tarball and kept going by changing the MUMPS package.py. But this feature, turned on routinely, would have saved me a bit of grief.

@scheibelp
Copy link
Member Author

No: sorry. Let me check in w/ Todd. It slipped my mind to ask him about it the last couple times we've talked.

@davydden
Copy link
Member

davydden commented May 9, 2016

and the trilinos website is down today, so at the moment one needs to hack another download source and a different hash. So the feature is indeed very welcome 😄 👍

scheibelp added 3 commits June 6, 2016 12:26
…lculate and compare the checksum. This achieves the original goal of discarding stale cache files without preserving multiple files for the same version.
… [1] to conditionally cache resource: only save it if there is a feature which identifies it uniquely (for example do not cache a repository if it pulls the latest state vs. a particular tag/commit)
…used to manage all mirror URLs - just the cache (the specific behavior that a URL may refer to a stale resource doesn't necessarily apply to mirrors)
@tgamblin tgamblin merged commit 1dc62e8 into spack:develop Jun 29, 2016
matz-e pushed a commit to matz-e/spack that referenced this pull request Apr 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants