Optimize `HfFileSystem.find` #1443

mariosasko · 2023-04-18T17:22:48Z

Similarly to S3FileSystem.find, optimize HfFileSystem.find to make it faster for Hub repositories with many files. The speed-up is achieved by fetching all the "tree" entries simultaneously instead of recursively for each directory, as done in the default implementation.

These changes make fetching the datasets/bigcode/the-stack-dedup data files 4x faster (30 sec. vs. 120 sec.) and the difference should become even bigger when we increase the number of returned files per page when paginating the /tree endpoint's response.

(Also important for huggingface/datasets#5537)

…fs-find

HuggingFaceDocBuilderDev · 2023-04-18T17:26:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

codecov · 2023-04-18T17:39:24Z

Codecov Report

Patch coverage: 83.78% and project coverage change: +3.42 🎉

Comparison is base (66c3ff1) 78.93% compared to head (c0e9b6e) 82.36%.

❗ Current head c0e9b6e differs from pull request most recent head 61aa5ef. Consider uploading reports for the commit 61aa5ef to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1443      +/-   ##
==========================================
+ Coverage   78.93%   82.36%   +3.42%     
==========================================
  Files          53       53              
  Lines        5641     5643       +2     
==========================================
+ Hits         4453     4648     +195     
+ Misses       1188      995     -193

Impacted Files	Coverage Δ
src/huggingface_hub/hf_file_system.py	`88.12% <83.78%> (-2.43%)`	⬇️

... and 8 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/huggingface_hub/hf_file_system.py

…fs-find

mariosasko · 2023-05-09T18:04:31Z

Let's continue this discussion (internal) here. As explained in the follow-up comment, we cannot use list_files_info as we also need directories, and we don't always fetch the files recursively, so inferring them from the returned file paths is also not possible. So adding a new method to HfApi seems the best option (e.g., get_tree). @Wauplin WDYT?

Wauplin · 2023-05-16T14:24:58Z

Ok, sorry for the long delay to answer. Yes indeed I get the point of having directories in addition to the files. What do you think of something like?

@dataclass
class TreeEntry:
    path: str
    oid: str
    type: Literal["file", "folder"]
    # API returns `0` for folders but I feel it's misleading.
    # What about `None` so that user don't think a folder is empty when size is 0?
    size: Optional[int]

def get_repo_tree(
    self,
    path_in_repo: str,
    *,
    recursive: bool=False,
    repo_id: str,
    repo_type: Optional[str] = None,
    revision: Optional[str] = None,
    token: Optional[str] = None,
) -> Iterable[TreeEntry]:
    ...

Just brainstorming here so happy to change. Once we are decided, it should be straightforward to implement. I wouldn't add too much details in the TreeEntry object as the aim is really to list of the folders/files. If a user really needs more details (last commit, last modified, security scan,...), list_repo_files is more appropriate. WDYT @mariosasko ?

mariosasko · 2023-05-16T19:10:18Z

Regarding the directory size, I prefer the value of 0 to be consistent with fsspec (and the API). I agree with the rest of the design.

Wauplin · 2023-05-17T05:09:10Z

Fine with directory size to 0 then :)

mariosasko · 2023-11-08T19:23:21Z

This PR is pretty old, so I'm closing it in favor of #1809

mariosasko added 8 commits April 7, 2023 14:22

Initial commit

a143c84

Merge branch 'main' of github.com:huggingface/huggingface_hub into hf…

fb1fbfa

…fs-find

Finish find imlementation

70276d8

Merge branch 'main' of github.com:huggingface/huggingface_hub into hf…

085241c

…fs-find

Add type hints to find

2082382

Add comment

9b07b1a

Tests

8f8b9fc

Style

c0e9b6e

mariosasko requested review from Wauplin and lhoestq April 18, 2023 17:22

lhoestq reviewed Apr 19, 2023

View reviewed changes

src/huggingface_hub/hf_file_system.py Outdated Show resolved Hide resolved

mariosasko added 4 commits May 2, 2023 19:09

Merge branch 'main' of github.com:huggingface/huggingface_hub into hf…

b7f0b30

…fs-find

Add comment

61aa5ef

Minor improvement

334509e

Fix

a77fbaa

mariosasko mentioned this pull request May 12, 2023

load_dataset('bigcode/the-stack-dedup', streaming=True) very slow! huggingface/datasets#5846

Closed

mariosasko mentioned this pull request Sep 1, 2023

Don't call the Hub datasets /tree endpoint with expand=True huggingface/dataset-viewer#1748

Closed

Wauplin added this to the in next release? milestone Oct 23, 2023

mariosasko mentioned this pull request Oct 30, 2023

Replace ReprMixin with dataclasses #1788

Merged

mariosasko mentioned this pull request Nov 8, 2023

Faster HfFileSystem.find #1809

Merged

mariosasko closed this Nov 8, 2023

mariosasko deleted the hffs-find branch November 8, 2023 19:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `HfFileSystem.find` #1443

Optimize `HfFileSystem.find` #1443

mariosasko commented Apr 18, 2023

HuggingFaceDocBuilderDev commented Apr 18, 2023

codecov bot commented Apr 18, 2023 •

edited

Loading

mariosasko commented May 9, 2023

Wauplin commented May 16, 2023 •

edited

Loading

mariosasko commented May 16, 2023

Wauplin commented May 17, 2023

mariosasko commented Nov 8, 2023

Optimize HfFileSystem.find #1443

Optimize HfFileSystem.find #1443

Conversation

mariosasko commented Apr 18, 2023

HuggingFaceDocBuilderDev commented Apr 18, 2023

codecov bot commented Apr 18, 2023 • edited Loading

Codecov Report

mariosasko commented May 9, 2023

Wauplin commented May 16, 2023 • edited Loading

mariosasko commented May 16, 2023

Wauplin commented May 17, 2023

mariosasko commented Nov 8, 2023

Optimize `HfFileSystem.find` #1443

Optimize `HfFileSystem.find` #1443

codecov bot commented Apr 18, 2023 •

edited

Loading

Wauplin commented May 16, 2023 •

edited

Loading