Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/reuse blobs #22

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

benedikt-schaber
Copy link

Current behavior

When the file/symlink is not found in snapshots the blob is downloaded regardless of whether or not it is already present.

New behavior

When the file/symlink is not found in snapshots the etag is used to check for the corresponding blob. It is used if already present.

Potential Issues

  • The python library provides users with the option to force the download of the file. If this request is accepted we should likely do the same. I am however unaware of how such options should be introduced in this repository
  • Probably not really an issue, existed before this change was introduced: If symlinking generally works on Windows but fails once, we move the file and end up with invalid symlinks (if I am not mistaken?, not 100% sure because it is Windows, but from what I understand of CreateSymbolicLinkW and std::os::windows::fs::symlink_file).

@Narsil
Copy link
Collaborator

Narsil commented Sep 8, 2023

We cannot move things like this.

The name of the method is download. It should therefore always download something.

get will already use the cache when possible and might be what you are looking for.

@benedikt-schaber
Copy link
Author

benedikt-schaber commented Sep 8, 2023

Edit: Sorry, misunderstood which get you meant for a second, more detail in the second comment.
@Narsil Get will not, and with current arguments, cannot check if the blob is already present, since we need the etag for it, which we receive from our head request.
Further download states in its doc string:
/// Downloads a remote file (if not already present) into the cache directory
This PR thus adheres to this guideline.

@benedikt-schaber
Copy link
Author

benedikt-schaber commented Sep 8, 2023

As a reference the python library huggingface hub does it here:

    if os.path.exists(blob_path) and not force_download:
        # we have the blob already, but not the pointer
        if local_dir is not None:  # to local dir
            return _to_local_dir(blob_path, local_dir, relative_filename, use_symlinks=local_dir_use_symlinks)
        else:  # or in snapshot cache
            _create_symlink(blob_path, pointer_path, new_blob=False)
            return pointer_path

in hf_hub_download from line 1371.

Although I realized I misunderstood which get you meant. I could move the check to the Api's get method, we would already have to query the metadata then however, and, if we do not want to requery it in download, it would have to be passed along changing download's signature.

@Narsil
Copy link
Collaborator

Narsil commented Sep 8, 2023

@benedikt-schaber

Having etag check + download if they do not match we can add (basically free updating).
There's a few consideration I've been thinking about this though:

  1. Usually models don't change, forcing network calls on each model load is excruciatingly slow (especially if the hub has any form of hiccup, or if you're anywhere outside of the US). This is far from ideal when most of the time it doesn't do anything.
    For instance currently with candle we're able to load entire CPU models in ~10ms, and a single call from europe has 50ms ping. In order to do serverless/lazy local inference it's definitely not great.

  2. When models did update you may want to instantly update, but you may be fine just starting the update on a different thread without blocking the cached load. And just using the new models later once the cache has been fully updated.

Architecturally speaking, copying huggingface_hub throughout is not something desirable. There's a lot of necessary legacy in huggingface_hub we can avoid here (because the design space is clearer now, and we don't depend on old behavior). For instance force_download is imho something we shouldn't have. Different functions should do what you want.

download is basically force_download
get is basically optional download.

The current goals are:

  • Don't mess up under any circumstances mess up existing cache
  • Reuse existing cache whenever possible.
  • Focus on a minimum subset, not rebuilding the entire thing.

Do you mind sharing a bit more the problem ?

I don't think this is about incremental downloads as the original issue states. And maybe thinking a bit about 1. and 2. and how your current setup places itself.

@Wauplin FYI

@benedikt-schaber
Copy link
Author

@Narsil
Ah, I think I should have made it a bit clearer what this PR does and have provided an example, since there seems to be a misunderstanding:

What this PR is not: If the file is already in the cache (snapshot), we do exactly the same thing that we did before the PR. We do NOT check if there is a newer version. Or to be more precise this PR does not have anything to do with detecting updates.

What this PR is about:

Here is an example, let's say I have my dataset on branch main, my commit is aaaaaa and I pull my dataset from there. Sadly a couple of labels for our dataset where wrong, I fix this and create a new commit bbbbbb. From 10s of GB only a single file changed, containing the labels. I decide I want to use commit bbbbbbnow instead of aaaaaa in my code.
currently, without the PR this happens:
No snapshot exists for bbbbbb yet since it is the first time I download the commit. We just check if the files are in snapshot, and if not download them (overriding the exact same data that was already in the blobs folder). This means only 1 file changed, but I now sit idle waiting for 10s of GBs I already had to download.
with this PR:
We do the same check for snapshot, which does not exist. But now for every file we also check if it is already in the blobs folder (by etag i.e. its hash). If we realize that we have already downloaded the file (because aaaaaa already used it in our case) we just create another symlink to it instead of first overriding it and then creating another symlink to it.

@benedikt-schaber
Copy link
Author

@Narsil Oh, one more important note (which I realize probably led to the misunderstanding): This is not the PR I wanted to create for #18, I have not yet come around to creating that PR.

@Narsil
Copy link
Collaborator

Narsil commented Sep 8, 2023

Illuminating !

Then perfect.
Do you think we could have a smallish test to test the behavior ?

@benedikt-schaber
Copy link
Author

Yes, I will write one.
Coming back to download/force_download:

This PR currently would create a difference between a download and a force_download.
In case a user's cache gets corrupted we probably should give them the ability to redownload everything.
I.e. even though a file with the correct etag is already there in blobs, overwrite it.

If we want to keep download as force_download, we could introduce a new internal download method, that additionally takes what we need of the metadata as input (e.g. blob path, total length).
We could then do the metadata fetch in get (if necessary) and use the internal download method from there.
Similarly we could fetch the metadata in download, but then just pass it along without checking to the internal download method.

get would thus no longer use download, but instead the new internal download function and we could retain downloads signature and force_download behaviour without compromising to doing an unecessary and additional head fetch for the metadata when using get.

Testing and differentiating between get and download relating to blob corruption and reuse.
Unix only
@Narsil
Copy link
Collaborator

Narsil commented Sep 8, 2023

If we want to keep download as force_download, we could introduce a new internal download method, that additionally takes what we need of the metadata as input (e.g. blob path, total length).

I think it's ok to expose this.

-> Cache.get, use cache no network
-> Api.get should end up becoming , use cache first, update in a separate thread if new data is available.
-> Api.get_latest current huggingface_hub default just like you suggest (check the sha, download iff sha has changed, but be sure to be running the latest file.
-> Api.download. Force download disregarding etag (current behavior).

I'm very open to other naming ideas if you feel there are alternatives which makes the distinctions clearer.
Let's feel free to do breaking changes here too, the lib is still very young, it's better to break now that create legacy too fast.

(For instance There are things to be done around Api.repo(xxx) owning a CacheRepo vs what currently exists (I just haven't gotten around to do it, since it would require splitting the Api struct into the pure client aspect, and the cache aspect so we could own ONLY a CacheRepo, and not a Cache within ApiRepo therefore removing the potential of out of sync data).

@benedikt-schaber
Copy link
Author

API Changes

I have implemented the functions, only for the sync version for now.
I will add the equivalent to tokio after we got them to a state where they are acceptable.
They are named:
download_blob (currently internal, can expose it): Used by the other functions, downloads/overrides a blob. Now used by download
create_blob_pointer (currently internal, temporary name, looking for feedback here): Similar to download, but only downloads blob (using download_blob) if necessary, otherwise only creates the pointer to it. get now uses this function if it can not find the symlink.

Testing

I have also created a unit test that tests the behavior of get and download (unix only, since on windows there are no symlinks, so this entire PR becomes rather meaningless).

Naming

Thoughts on a proper name for create_blob_pointer and why I did not choose any that I could think of:

  • get_blob: It not only gets the blob, but also creates the smylink to it/moves it.
  • get_blob_pointer: Sounds like it either gets or if not present creates the blob pointer. But it always creates the pointer (more or less) and sometimes downloads the blob

Basically also just get without the snapshot cache lookup first. Could be thus named something alluding to that or could also perhaps be just integrated into the get function itself.

Potential Issue that existed before this PR, but is related

Considering corruption.
In symlink_or_rename we do nothing if the file is already present at destination. This works fine for Linux. But on windows, I think if download truly is supposed to be force download, then, if we have a corrupted file, would it not already be present at destination? So even though we would have just gone through the effort of downloading a version that is not corrupted, in the end the corrupted version would still be used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants