-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat/reuse blobs #22
base: main
Are you sure you want to change the base?
feat/reuse blobs #22
Conversation
We cannot move things like this. The name of the method is
|
Edit: Sorry, misunderstood which get you meant for a second, more detail in the second comment. |
As a reference the python library huggingface hub does it here:
in hf_hub_download from line 1371. Although I realized I misunderstood which get you meant. I could move the check to the Api's get method, we would already have to query the metadata then however, and, if we do not want to requery it in download, it would have to be passed along changing download's signature. |
Having etag check + download if they do not match we can add (basically free updating).
Architecturally speaking, copying
The current goals are:
Do you mind sharing a bit more the problem ? I don't think this is about @Wauplin FYI |
@Narsil What this PR is not: If the file is already in the cache (snapshot), we do exactly the same thing that we did before the PR. We do NOT check if there is a newer version. Or to be more precise this PR does not have anything to do with detecting updates. What this PR is about:Here is an example, let's say I have my dataset on branch main, my commit is |
Illuminating ! Then perfect. |
Yes, I will write one. This PR currently would create a difference between a download and a force_download. If we want to keep download as force_download, we could introduce a new internal download method, that additionally takes what we need of the metadata as input (e.g. blob path, total length). get would thus no longer use download, but instead the new internal download function and we could retain downloads signature and force_download behaviour without compromising to doing an unecessary and additional head fetch for the metadata when using get. |
Testing and differentiating between get and download relating to blob corruption and reuse. Unix only
I think it's ok to expose this. -> I'm very open to other naming ideas if you feel there are alternatives which makes the distinctions clearer. (For instance There are things to be done around |
API ChangesI have implemented the functions, only for the sync version for now. TestingI have also created a unit test that tests the behavior of get and download (unix only, since on windows there are no symlinks, so this entire PR becomes rather meaningless). NamingThoughts on a proper name for create_blob_pointer and why I did not choose any that I could think of:
Basically also just get without the snapshot cache lookup first. Could be thus named something alluding to that or could also perhaps be just integrated into the get function itself. Potential Issue that existed before this PR, but is relatedConsidering corruption. |
Current behavior
When the file/symlink is not found in snapshots the blob is downloaded regardless of whether or not it is already present.
New behavior
When the file/symlink is not found in snapshots the etag is used to check for the corresponding blob. It is used if already present.
Potential Issues