Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pooch for download_example_data #446

Merged
merged 5 commits into from
Aug 3, 2024
Merged

Use pooch for download_example_data #446

merged 5 commits into from
Aug 3, 2024

Conversation

teutoburg
Copy link
Contributor

@teutoburg teutoburg commented Aug 1, 2024

TLDR Summary

The astropy-based download and cache functionality previously used here led to many failed runs in ScopeSim_Data. This should hopefully improve now with pooch, which among other applications is specifically advertised for use on "example data". It doesn't really add an additional dependency, because we're already using pooch as part of spextra, and (at least in testing) also through SciPy (the raccoon face image).

On the download_example_data function

There is an argument to be made about removing the optional url argument, because we can change the server URL through the cmds anyway (hopefully we don't have to...) and it's unlikely anyone would ever use this function to download different data from somewhere else. But I'm also fine with keeping it.

I changed the default cache location to ~.astar/scopesim/, in line with e.g. skycalc_ipy. Caching should also work better now with pooch, see below.

On hashes and caches

Paraphrasing from their documentation, pooch uses hashes to determine if the requested file on the server is different from the one stored in the cache and thus whether it needs to be downloaded. In all setups that are not "quick and dirty script mode", it refuses to download files without a known hash provided. In theory, pooch also supports (and encourages the use of) versioning, but I think that only works when the data is stored on a remote repo, e.g. GitHub, which we're explicitly avoiding because of file sizes involved.

Storing the hashes in a separate file is described in their docs, although I suppose we're still well below having a "large number of files", so we might as well put the hashes in a dict. I don't think it'll make a practical difference. If we modify or add files in the future, we can simply change or add the corresponding hash and everything should work just fine.

Now I'm not sure how the caching behavior affects ScopeSim_Data, but we'll see...

@teutoburg teutoburg self-assigned this Aug 1, 2024
Copy link

codecov bot commented Aug 1, 2024

Codecov Report

Attention: Patch coverage is 18.75000% with 13 lines in your changes missing coverage. Please review.

Project coverage is 75.58%. Comparing base (6b2a07c) to head (e5fc68c).

Files Patch % Lines
scopesim/server/example_data_utils.py 18.75% 13 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #446      +/-   ##
==========================================
+ Coverage   75.50%   75.58%   +0.08%     
==========================================
  Files          64       64              
  Lines        7833     7824       -9     
==========================================
  Hits         5914     5914              
+ Misses       1919     1910       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@teutoburg teutoburg changed the title Fh/dwnld Use pooch for download_example_data Aug 1, 2024
@hugobuddel
Copy link
Collaborator

I suppose the idea behind pooch is to put the hash in the code (and thereby in the git commit hash etc.), so you know what file you are going to get for reproducibility (and perhaps security).

If we don't care about reproducibility or security (and perhaps we don't) but only data integrity, then we don't need this extra hash but can instead rely on the checksums that could/should be in the header of the FITS files (e.g. DATAMD5,CHECKSUM,DATASUM).

So I don't see a good reason for having the hashes in a separate file. It doesn't really cause any problems either, so whatever.

@hugobuddel
Copy link
Collaborator

But using pooch is obviously better than what we had, since it actually works, so don't pay too much attention to my comment about the hashes.

@teutoburg teutoburg marked this pull request as ready for review August 2, 2024 23:23
@teutoburg
Copy link
Contributor Author

It seems the download function isn't actually covered by our tests. We could add a test file on the server and cache it to a temp dir in a test (indeed a webtest), that way it would be downloaded each time...

@teutoburg
Copy link
Contributor Author

Should we nevertheless add pooch to the dependencies in the pyproject.toml file?

Copy link
Collaborator

@hugobuddel hugobuddel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Thank you @teutoburg for making our lives a bit nicer once more.

I should read before I type, I thought the hashes were on the server, but they are in the repository, great!

About pooch, I think it would be better to add it as a dependency directly, but this should be fine too. Maybe some IDE could complain?

Apparently this function is only used in the irdb? Still fine to have it here though. Would be better to have a test here too, but yeah, it is tested through the IRDB already. I suppose that we 'should' add a unittest and mock the server, so we get coverage without it being a web test, but whatever. It gets tested when the irdb notebooks are ran, which we do every night, which is good enough for me for now.

scopesim/server/example_data_utils.py Outdated Show resolved Hide resolved
scopesim/server/example_data_utils.py Show resolved Hide resolved
teutoburg and others added 2 commits August 3, 2024 13:51
@teutoburg
Copy link
Contributor Author

I should read before I type

Goes for me as well it seems. Pooch already was in our pyproject.toml dependencies, probably for the raccoon face thing. Oops.

Anyway, after reading through their changelog, I think it's worth upgrading to the latest version, so I did.

@teutoburg teutoburg merged commit 7689b87 into main Aug 3, 2024
24 of 25 checks passed
@teutoburg teutoburg deleted the fh/dwnld branch August 3, 2024 19:21
@teutoburg teutoburg added the dependencies Related to or updating any dependencies label Aug 3, 2024
@hugobuddel
Copy link
Collaborator

Oops, now the METIS notebooks cannot find the downloaded files anymore:
https://github.com/AstarVienna/ScopeSim_Data/actions/runs/10233670841/job/28312246596

FileNotFoundError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 input_hdul = fits.open("HL_Tau_prep_for_Scopesim.fits")

Apparently the notebooks expects the downloaded file to be present in the current working directory.

Should the notebooks be fixed and use the returned paths, or should download function place copies/symlinks in the current working directory?

I think we should update the notebooks.

@hugobuddel
Copy link
Collaborator

I should read before I type

Goes for me as well it seems. Pooch already was in our pyproject.toml dependencies, probably for the raccoon face thing. Oops.

Anyway, after reading through their changelog, I think it's worth upgrading to the latest version, so I did.

Did this perhaps break the windows update-to-latest-dependencies test? https://github.com/AstarVienna/ScopeSim/actions/runs/10233695289/job/28316098713

@teutoburg
Copy link
Contributor Author

Did this perhaps break the windows update-to-latest-dependencies test? https://github.com/AstarVienna/ScopeSim/actions/runs/10233695289/job/28316098713

Running poetry update locally on Win11 Py3.11 works fine (from current main).

However, on my Win10 machine that I recently upgraded from Py3.9 to 3.12, I get weird errors:

  • Simple poetry install (from lock file) fails around pyerfa. Both on main and on 6b2a07c, before the recent changes. Seems to be a problem on that machine caused by the 3.9->3.12 update...
  • Running poetry update solves the pyerfa, but I get an error from matplotlib instead. Attempting to just update this pyerfa, it now fails on synphot. I think I broke my installation here 😑

@teutoburg
Copy link
Contributor Author

However, on my Win10 machine that I recently upgraded from Py3.9 to 3.12, I get weird errors:

  • Simple poetry install (from lock file) fails around pyerfa. Both on main and on 6b2a07c, before the recent changes. Seems to be a problem on that machine caused by the 3.9->3.12 update...
  • Running poetry update solves the pyerfa, but I get an error from matplotlib instead. Attempting to just update this pyerfa, it now fails on synphot. I think I broke my installation here 😑

This should be solved by #450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Related to or updating any dependencies
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

2 participants