-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize HfFileSystem.find
#1443
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1443 +/- ##
==========================================
+ Coverage 78.93% 82.36% +3.42%
==========================================
Files 53 53
Lines 5641 5643 +2
==========================================
+ Hits 4453 4648 +195
+ Misses 1188 995 -193
☔ View full report in Codecov by Sentry. |
Let's continue this discussion (internal) here. As explained in the follow-up comment, we cannot use |
Ok, sorry for the long delay to answer. Yes indeed I get the point of having directories in addition to the files. What do you think of something like? @dataclass
class TreeEntry:
path: str
oid: str
type: Literal["file", "folder"]
# API returns `0` for folders but I feel it's misleading.
# What about `None` so that user don't think a folder is empty when size is 0?
size: Optional[int]
def get_repo_tree(
self,
path_in_repo: str,
*,
recursive: bool=False,
repo_id: str,
repo_type: Optional[str] = None,
revision: Optional[str] = None,
token: Optional[str] = None,
) -> Iterable[TreeEntry]:
... Just brainstorming here so happy to change. Once we are decided, it should be straightforward to implement. I wouldn't add too much details in the |
Regarding the directory size, I prefer the value of 0 to be consistent with |
Fine with directory size to 0 then :) |
This PR is pretty old, so I'm closing it in favor of #1809 |
Similarly to
S3FileSystem.find
, optimizeHfFileSystem.find
to make it faster for Hub repositories with many files. The speed-up is achieved by fetching all the "tree" entries simultaneously instead of recursively for each directory, as done in the default implementation.These changes make fetching the
datasets/bigcode/the-stack-dedup
data files 4x faster (30 sec. vs. 120 sec.) and the difference should become even bigger when we increase the number of returned files per page when paginating the/tree
endpoint's response.(Also important for huggingface/datasets#5537)