-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pooch
for download_example_data
#446
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #446 +/- ##
==========================================
+ Coverage 75.50% 75.58% +0.08%
==========================================
Files 64 64
Lines 7833 7824 -9
==========================================
Hits 5914 5914
+ Misses 1919 1910 -9 ☔ View full report in Codecov by Sentry. |
I suppose the idea behind pooch is to put the hash in the code (and thereby in the git commit hash etc.), so you know what file you are going to get for reproducibility (and perhaps security). If we don't care about reproducibility or security (and perhaps we don't) but only data integrity, then we don't need this extra hash but can instead rely on the checksums that could/should be in the header of the FITS files (e.g. So I don't see a good reason for having the hashes in a separate file. It doesn't really cause any problems either, so whatever. |
But using pooch is obviously better than what we had, since it actually works, so don't pay too much attention to my comment about the hashes. |
It seems the download function isn't actually covered by our tests. We could add a test file on the server and cache it to a temp dir in a test (indeed a |
Should we nevertheless add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! Thank you @teutoburg for making our lives a bit nicer once more.
I should read before I type, I thought the hashes were on the server, but they are in the repository, great!
About pooch, I think it would be better to add it as a dependency directly, but this should be fine too. Maybe some IDE could complain?
Apparently this function is only used in the irdb? Still fine to have it here though. Would be better to have a test here too, but yeah, it is tested through the IRDB already. I suppose that we 'should' add a unittest and mock the server, so we get coverage without it being a web test, but whatever. It gets tested when the irdb notebooks are ran, which we do every night, which is good enough for me for now.
Co-authored-by: Hugo Buddelmeijer <[email protected]>
Goes for me as well it seems. Pooch already was in our Anyway, after reading through their changelog, I think it's worth upgrading to the latest version, so I did. |
Oops, now the METIS notebooks cannot find the downloaded files anymore:
Apparently the notebooks expects the downloaded file to be present in the current working directory. Should the notebooks be fixed and use the returned paths, or should download function place copies/symlinks in the current working directory? I think we should update the notebooks. |
Did this perhaps break the windows update-to-latest-dependencies test? https://github.com/AstarVienna/ScopeSim/actions/runs/10233695289/job/28316098713 |
Running However, on my Win10 machine that I recently upgraded from Py3.9 to 3.12, I get weird errors:
|
This should be solved by #450 |
TLDRSummaryThe astropy-based download and cache functionality previously used here led to many failed runs in ScopeSim_Data. This should hopefully improve now with
pooch
, which among other applications is specifically advertised for use on "example data". It doesn't really add an additional dependency, because we're already using pooch as part ofspextra
, and (at least in testing) also through SciPy (the raccoon face image).On the
download_example_data
functionThere is an argument to be made about removing the optional
url
argument, because we can change the server URL through the cmds anyway (hopefully we don't have to...) and it's unlikely anyone would ever use this function to download different data from somewhere else. But I'm also fine with keeping it.I changed the default cache location to
~.astar/scopesim/
, in line with e.g.skycalc_ipy
. Caching should also work better now withpooch
, see below.On hashes and caches
Paraphrasing from their documentation,
pooch
uses hashes to determine if the requested file on the server is different from the one stored in the cache and thus whether it needs to be downloaded. In all setups that are not "quick and dirty script mode", it refuses to download files without a known hash provided. In theory,pooch
also supports (and encourages the use of) versioning, but I think that only works when the data is stored on a remote repo, e.g. GitHub, which we're explicitly avoiding because of file sizes involved.Storing the hashes in a separate file is described in their docs, although I suppose we're still well below having a "large number of files", so we might as well put the hashes in a dict. I don't think it'll make a practical difference. If we modify or add files in the future, we can simply change or add the corresponding hash and everything should work just fine.
Now I'm not sure how the caching behavior affects ScopeSim_Data, but we'll see...