feat/reuse blobs #22

benedikt-schaber · 2023-09-07T16:15:41Z

Current behavior

When the file/symlink is not found in snapshots the blob is downloaded regardless of whether or not it is already present.

New behavior

When the file/symlink is not found in snapshots the etag is used to check for the corresponding blob. It is used if already present.

Potential Issues

The python library provides users with the option to force the download of the file. If this request is accepted we should likely do the same. I am however unaware of how such options should be introduced in this repository
Probably not really an issue, existed before this change was introduced: If symlinking generally works on Windows but fails once, we move the file and end up with invalid symlinks (if I am not mistaken?, not 100% sure because it is Windows, but from what I understand of CreateSymbolicLinkW and std::os::windows::fs::symlink_file).

Narsil · 2023-09-08T07:48:40Z

We cannot move things like this.

The name of the method is download. It should therefore always download something.

get will already use the cache when possible and might be what you are looking for.

benedikt-schaber · 2023-09-08T07:54:52Z

Edit: Sorry, misunderstood which get you meant for a second, more detail in the second comment.
@Narsil Get will not, and with current arguments, cannot check if the blob is already present, since we need the etag for it, which we receive from our head request.
Further download states in its doc string:
/// Downloads a remote file (if not already present) into the cache directory
This PR thus adheres to this guideline.

benedikt-schaber · 2023-09-08T08:00:42Z

As a reference the python library huggingface hub does it here:

    if os.path.exists(blob_path) and not force_download:
        # we have the blob already, but not the pointer
        if local_dir is not None:  # to local dir
            return _to_local_dir(blob_path, local_dir, relative_filename, use_symlinks=local_dir_use_symlinks)
        else:  # or in snapshot cache
            _create_symlink(blob_path, pointer_path, new_blob=False)
            return pointer_path

in hf_hub_download from line 1371.

Although I realized I misunderstood which get you meant. I could move the check to the Api's get method, we would already have to query the metadata then however, and, if we do not want to requery it in download, it would have to be passed along changing download's signature.

Narsil · 2023-09-08T10:12:02Z

@benedikt-schaber

Having etag check + download if they do not match we can add (basically free updating).
There's a few consideration I've been thinking about this though:

Usually models don't change, forcing network calls on each model load is excruciatingly slow (especially if the hub has any form of hiccup, or if you're anywhere outside of the US). This is far from ideal when most of the time it doesn't do anything.
For instance currently with candle we're able to load entire CPU models in ~10ms, and a single call from europe has 50ms ping. In order to do serverless/lazy local inference it's definitely not great.
When models did update you may want to instantly update, but you may be fine just starting the update on a different thread without blocking the cached load. And just using the new models later once the cache has been fully updated.

Architecturally speaking, copying huggingface_hub throughout is not something desirable. There's a lot of necessary legacy in huggingface_hub we can avoid here (because the design space is clearer now, and we don't depend on old behavior). For instance force_download is imho something we shouldn't have. Different functions should do what you want.

download is basically force_download
get is basically optional download.

The current goals are:

Don't mess up under any circumstances mess up existing cache
Reuse existing cache whenever possible.
Focus on a minimum subset, not rebuilding the entire thing.

Do you mind sharing a bit more the problem ?

I don't think this is about incremental downloads as the original issue states. And maybe thinking a bit about 1. and 2. and how your current setup places itself.

@Wauplin FYI

benedikt-schaber · 2023-09-08T10:30:53Z

@Narsil
Ah, I think I should have made it a bit clearer what this PR does and have provided an example, since there seems to be a misunderstanding:

What this PR is not: If the file is already in the cache (snapshot), we do exactly the same thing that we did before the PR. We do NOT check if there is a newer version. Or to be more precise this PR does not have anything to do with detecting updates.

What this PR is about:

Here is an example, let's say I have my dataset on branch main, my commit is aaaaaa and I pull my dataset from there. Sadly a couple of labels for our dataset where wrong, I fix this and create a new commit bbbbbb. From 10s of GB only a single file changed, containing the labels. I decide I want to use commit bbbbbbnow instead of aaaaaa in my code.
currently, without the PR this happens:
No snapshot exists for bbbbbb yet since it is the first time I download the commit. We just check if the files are in snapshot, and if not download them (overriding the exact same data that was already in the blobs folder). This means only 1 file changed, but I now sit idle waiting for 10s of GBs I already had to download.
with this PR:
We do the same check for snapshot, which does not exist. But now for every file we also check if it is already in the blobs folder (by etag i.e. its hash). If we realize that we have already downloaded the file (because aaaaaa already used it in our case) we just create another symlink to it instead of first overriding it and then creating another symlink to it.

benedikt-schaber · 2023-09-08T10:51:27Z

@Narsil Oh, one more important note (which I realize probably led to the misunderstanding): This is not the PR I wanted to create for #18, I have not yet come around to creating that PR.

Narsil · 2023-09-08T11:17:13Z

Illuminating !

Then perfect.
Do you think we could have a smallish test to test the behavior ?

benedikt-schaber · 2023-09-08T11:59:51Z

Yes, I will write one.
Coming back to download/force_download:

This PR currently would create a difference between a download and a force_download.
In case a user's cache gets corrupted we probably should give them the ability to redownload everything.
I.e. even though a file with the correct etag is already there in blobs, overwrite it.

If we want to keep download as force_download, we could introduce a new internal download method, that additionally takes what we need of the metadata as input (e.g. blob path, total length).
We could then do the metadata fetch in get (if necessary) and use the internal download method from there.
Similarly we could fetch the metadata in download, but then just pass it along without checking to the internal download method.

get would thus no longer use download, but instead the new internal download function and we could retain downloads signature and force_download behaviour without compromising to doing an unecessary and additional head fetch for the metadata when using get.

Testing and differentiating between get and download relating to blob corruption and reuse. Unix only

Narsil · 2023-09-08T14:30:07Z

If we want to keep download as force_download, we could introduce a new internal download method, that additionally takes what we need of the metadata as input (e.g. blob path, total length).

I think it's ok to expose this.

-> Cache.get, use cache no network
-> Api.get should end up becoming , use cache first, update in a separate thread if new data is available.
-> Api.get_latest current huggingface_hub default just like you suggest (check the sha, download iff sha has changed, but be sure to be running the latest file.
-> Api.download. Force download disregarding etag (current behavior).

I'm very open to other naming ideas if you feel there are alternatives which makes the distinctions clearer.
Let's feel free to do breaking changes here too, the lib is still very young, it's better to break now that create legacy too fast.

(For instance There are things to be done around Api.repo(xxx) owning a CacheRepo vs what currently exists (I just haven't gotten around to do it, since it would require splitting the Api struct into the pure client aspect, and the cache aspect so we could own ONLY a CacheRepo, and not a Cache within ApiRepo therefore removing the potential of out of sync data).

benedikt-schaber · 2023-09-08T14:54:26Z

API Changes

I have implemented the functions, only for the sync version for now.
I will add the equivalent to tokio after we got them to a state where they are acceptable.
They are named:
download_blob (currently internal, can expose it): Used by the other functions, downloads/overrides a blob. Now used by download
create_blob_pointer (currently internal, temporary name, looking for feedback here): Similar to download, but only downloads blob (using download_blob) if necessary, otherwise only creates the pointer to it. get now uses this function if it can not find the symlink.

Testing

I have also created a unit test that tests the behavior of get and download (unix only, since on windows there are no symlinks, so this entire PR becomes rather meaningless).

Naming

Thoughts on a proper name for create_blob_pointer and why I did not choose any that I could think of:

get_blob: It not only gets the blob, but also creates the smylink to it/moves it.
get_blob_pointer: Sounds like it either gets or if not present creates the blob pointer. But it always creates the pointer (more or less) and sometimes downloads the blob

Basically also just get without the snapshot cache lookup first. Could be thus named something alluding to that or could also perhaps be just integrated into the get function itself.

Potential Issue that existed before this PR, but is related

Considering corruption.
In symlink_or_rename we do nothing if the file is already present at destination. This works fine for Linux. But on windows, I think if download truly is supposed to be force download, then, if we have a corrupted file, would it not already be present at destination? So even though we would have just gone through the effort of downloading a version that is not corrupted, in the end the corrupted version would still be used?

benedikt-schaber added 2 commits September 7, 2023 17:44

Reuse CacheRepo in sync::ApiRepo::download

098b4ef

Reuse blobs in ApiRepo::download

0ac10c8

benedikt-schaber added 3 commits September 8, 2023 16:05

Make download in snyc.rs force download again

b9afa51

Add corruption test in sync.rs

a58a6e5

Testing and differentiating between get and download relating to blob corruption and reuse. Unix only

Adjust download docstring for sync.rs

819b09c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/reuse blobs #22

feat/reuse blobs #22

benedikt-schaber commented Sep 7, 2023

Narsil commented Sep 8, 2023

benedikt-schaber commented Sep 8, 2023 •

edited

Loading

benedikt-schaber commented Sep 8, 2023 •

edited

Loading

Narsil commented Sep 8, 2023

benedikt-schaber commented Sep 8, 2023

benedikt-schaber commented Sep 8, 2023

Narsil commented Sep 8, 2023

benedikt-schaber commented Sep 8, 2023

Narsil commented Sep 8, 2023 •

edited

Loading

benedikt-schaber commented Sep 8, 2023

feat/reuse blobs #22

Are you sure you want to change the base?

feat/reuse blobs #22

Conversation

benedikt-schaber commented Sep 7, 2023

Current behavior

New behavior

Potential Issues

Narsil commented Sep 8, 2023

benedikt-schaber commented Sep 8, 2023 • edited Loading

benedikt-schaber commented Sep 8, 2023 • edited Loading

Narsil commented Sep 8, 2023

benedikt-schaber commented Sep 8, 2023

What this PR is about:

benedikt-schaber commented Sep 8, 2023

Narsil commented Sep 8, 2023

benedikt-schaber commented Sep 8, 2023

Narsil commented Sep 8, 2023 • edited Loading

benedikt-schaber commented Sep 8, 2023

API Changes

Testing

Naming

Potential Issue that existed before this PR, but is related

benedikt-schaber commented Sep 8, 2023 •

edited

Loading

benedikt-schaber commented Sep 8, 2023 •

edited

Loading

Narsil commented Sep 8, 2023 •

edited

Loading