Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize HfFileSystem.find #1443

Closed
wants to merge 12 commits into from
Closed

Optimize HfFileSystem.find #1443

wants to merge 12 commits into from

Conversation

mariosasko
Copy link
Contributor

Similarly to S3FileSystem.find, optimize HfFileSystem.find to make it faster for Hub repositories with many files. The speed-up is achieved by fetching all the "tree" entries simultaneously instead of recursively for each directory, as done in the default implementation.

These changes make fetching the datasets/bigcode/the-stack-dedup data files 4x faster (30 sec. vs. 120 sec.) and the difference should become even bigger when we increase the number of returned files per page when paginating the /tree endpoint's response.

(Also important for huggingface/datasets#5537)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@codecov
Copy link

codecov bot commented Apr 18, 2023

Codecov Report

Patch coverage: 83.78% and project coverage change: +3.42 🎉

Comparison is base (66c3ff1) 78.93% compared to head (c0e9b6e) 82.36%.

❗ Current head c0e9b6e differs from pull request most recent head 61aa5ef. Consider uploading reports for the commit 61aa5ef to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1443      +/-   ##
==========================================
+ Coverage   78.93%   82.36%   +3.42%     
==========================================
  Files          53       53              
  Lines        5641     5643       +2     
==========================================
+ Hits         4453     4648     +195     
+ Misses       1188      995     -193     
Impacted Files Coverage Δ
src/huggingface_hub/hf_file_system.py 88.12% <83.78%> (-2.43%) ⬇️

... and 8 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mariosasko
Copy link
Contributor Author

Let's continue this discussion (internal) here. As explained in the follow-up comment, we cannot use list_files_info as we also need directories, and we don't always fetch the files recursively, so inferring them from the returned file paths is also not possible. So adding a new method to HfApi seems the best option (e.g., get_tree). @Wauplin WDYT?

@Wauplin
Copy link
Contributor

Wauplin commented May 16, 2023

Ok, sorry for the long delay to answer. Yes indeed I get the point of having directories in addition to the files. What do you think of something like?

@dataclass
class TreeEntry:
    path: str
    oid: str
    type: Literal["file", "folder"]
    # API returns `0` for folders but I feel it's misleading.
    # What about `None` so that user don't think a folder is empty when size is 0?
    size: Optional[int]

def get_repo_tree(
    self,
    path_in_repo: str,
    *,
    recursive: bool=False,
    repo_id: str,
    repo_type: Optional[str] = None,
    revision: Optional[str] = None,
    token: Optional[str] = None,
) -> Iterable[TreeEntry]:
    ...

Just brainstorming here so happy to change. Once we are decided, it should be straightforward to implement. I wouldn't add too much details in the TreeEntry object as the aim is really to list of the folders/files. If a user really needs more details (last commit, last modified, security scan,...), list_repo_files is more appropriate. WDYT @mariosasko ?

@mariosasko
Copy link
Contributor Author

Regarding the directory size, I prefer the value of 0 to be consistent with fsspec (and the API). I agree with the rest of the design.

@Wauplin
Copy link
Contributor

Wauplin commented May 17, 2023

Fine with directory size to 0 then :)

@mariosasko
Copy link
Contributor Author

This PR is pretty old, so I'm closing it in favor of #1809

@mariosasko mariosasko closed this Nov 8, 2023
@mariosasko mariosasko deleted the hffs-find branch November 8, 2023 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants