Faster `HfFileSystem.find` #1809

mariosasko · 2023-11-08T19:22:43Z

Supersedes #1443

Makes the HfFileSystem.find method faster by making it "one-shot" instead of calling ls on each subdirectory. It also makes HfFileSystem.info faster by replacing the default implementation that fetches the parent path's /tree with a call to the /paths-info endpoint.

To implement this, this PR adds the following methods to the HfApi:

list_repo_tree (to optimize HfFileSystem.find)
get_paths_info (to optimize HfFileSystem.get_paths_info)

Additional improvements:

Adds extra fields (last_commit, security, etc.) to the output when detail=True to make it more consistent with HfApi's RepoFile (and RepoFolder)
Reverts Fix fsspec tests in ci #1635 (NotImplementedError = does not exist as per AbstractFileSystem.exists() catches every exception fsspec/filesystem_spec#1379 (comment)) and updates the test (it's more consistent with glob.glob now)

…timize-hffs-find

HuggingFaceDocBuilderDev · 2023-11-08T19:29:57Z

The documentation is not available anymore as the PR was closed or merged.

mariosasko · 2023-11-08T19:56:55Z

A HfFileSystem.find speed comparison on datasets/bigcode/the-stack-dedup between this PR's branch and main:

faster-hffs-find:
  hffs.find("datasets/bigcode/the-stack-dedup", detail=False) = 1.63 s
  hffs.find("datasets/bigcode/the-stack-dedup", detail=True) = 24.2 s

main:
  hffs.find("datasets/bigcode/the-stack-dedup", detail=False) = 46.2 s
  hffs.find("datasets/bigcode/the-stack-dedup", detail=True) = 47.3 s

…ster-hffs-find

Wauplin

Thanks @mariosasko for the PR and all the tests. I reviewed the logic and it looks good to me but I'd prefer a second pair of eyes on this (@lhoestq?). I only spotted 1 potential issue when detecting the common path of a list of dirs.

Otherwise, I like the list_repo_tree and get_paths_info that are more consistent with the server-side API. I was wondering if we could get rid of list_paths_info that is a mix of both, without any optimization. I saw it is used in datasets but list_repo_tree can be a drop-in replacement for it. In any case let's not do it in this PR that is already quite big.

src/huggingface_hub/hf_api.py

Wauplin · 2023-11-09T08:19:42Z

src/huggingface_hub/hf_api.py

@@ -226,18 +226,21 @@ def repo_type_and_id_from_hf_id(hf_id: str, hub_url: Optional[str] = None) -> Tu
    return repo_type, namespace, repo_id


+class LastCommitInfo(TypedDict, total=False):


I know it is not your fault but I'm not a fan of those TypedDict... 😕

I think we should open a separate PR to switch all of them (BlobLfsInfo, LastCommitInfo, BlobSecurityInfo, TransformersInfo, SafeTensorsInfo) to dataclasses, in a backward compatible way (and with a deprecation warning when dict-only method is used). This way we'll finally have a single type to represent data returned by the server (now that you've removed ReprMixin).

Makes sense! Maybe this can be a "good first issue"?

Wauplin · 2023-11-09T09:00:07Z

src/huggingface_hub/hf_file_system.py

+            if (recursive and dirs_not_in_dircache) or (detail and dirs_not_expanded):
+                # If the dircache is incomplete, find the common path of the missing and non-expanded entries
+                # and extend the output with the result of `_ls_tree(common_path, recursive=True)`
+                common_path = os.path.commonprefix(dirs_not_in_dircache + dirs_not_expanded).rstrip("/")


According to Python docs, common_path is not guaranteed to be a valid path. For example,

>>> import os >>> os.path.commonprefix(["data/train/", "data/test/"]) "data/t"

which can lead to an unexpected result in self._ls_tree(common_path, ...) just below.

I saw there is os.path.commonpath as well but I'm afraid it will not work on Windows. What we can do is to use PurePosixPath like this:

from pathlib import PurePosixPath from typing import List def unix_commonpath(paths: List[str]) -> str: """Return the longest common subpath of a list of Unix paths.""" common_parents = set.intersection( *[set(parent.as_posix() for parent in PurePosixPath(path).parents) for path in paths] ) return max(common_parents, key=lambda p: len(p))

>>> unix_commonpath(["data/train/", "data/test/"]) "data"

Let me know if you think of a simpler solution. If we go with this one, could you add a few test cases to ensure it works cross-platform?

Also feel free to add "/" at the end of the common path, otherwise the subsequent .startswith(common_path) calls will match directories like "data_with_a_suffix"

Good catch!

There is a simpler fix to this. I added a test to ensure it works.

Wauplin · 2023-11-09T09:13:19Z

src/huggingface_hub/hf_file_system.py

+                "name": path,
+                "size": 0,
+                "type": "directory",
+                "tree_id": None,  # TODO: tree_id of the root directory?


Good question. Not sure there's one.

lhoestq

Nice ! 🔥 🔥 🔥

lhoestq · 2023-11-09T11:56:58Z

src/huggingface_hub/hf_file_system.py

+            if (recursive and dirs_not_in_dircache) or (detail and dirs_not_expanded):
+                # If the dircache is incomplete, find the common path of the missing and non-expanded entries
+                # and extend the output with the result of `_ls_tree(common_path, recursive=True)`
+                common_path = os.path.commonprefix(dirs_not_in_dircache + dirs_not_expanded).rstrip("/")


Also feel free to add "/" at the end of the common path, otherwise the subsequent .startswith(common_path) calls will match directories like "data_with_a_suffix"

src/huggingface_hub/hf_file_system.py

lhoestq · 2023-11-09T12:09:05Z

src/huggingface_hub/hf_file_system.py


-    def info(self, path: str, **kwargs) -> Dict[str, Any]:
-        resolved_path = self.resolve_path(path)
+    def info(self, path: str, refresh: bool = False, revision: Optional[str] = None, **kwargs) -> Dict[str, Any]:


What if we want the info with detail=False ?

The spec doesn't expose the detail parameter in .info: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/spec.py#L639.

IMO, not exposing this parameter makes sense, as the .info method should return all the existing info about a file.

This would make calls to isdir pretty slow no ? We use it a lot in datasets

Ideally isdir calls info() with detail=False and info() doesn't need to call expand=True in this case, which is slow on the Hub side.

It's a bit out of scope of this PR though, we can see later

Made the change to allow calling .info with detail=False (through the kwargs) to support efficient exists/isdir/isfile, but detail is not epoxsed as the method parameter to follow the spec.

Co-authored-by: Lucain <[email protected]>

mariosasko · 2023-11-09T15:53:07Z

@Wauplin

Otherwise, I like the list_repo_tree and get_paths_info that are more consistent with the server-side API. I was wondering if we could get rid of list_paths_info that is a mix of both, without any optimization. I saw it is used in datasets but list_repo_tree can be a drop-in replacement for it. In any case let's not do it in this PR that is already quite big.

list_paths_info is not in the API. Did you mean list_files_info?

Wauplin · 2023-11-09T15:53:49Z

list_paths_info is not in the API. Did you mean list_files_info?

Ah yes, I meant list_files_info

lhoestq

LGTM !

lhoestq · 2023-11-10T10:44:44Z

I also opened #1815 (continuation of this PR) to apply this optimization to glob and introduce the expand_info argument. Let me know what you think :)

Wauplin

LGTM! Thanks for making the changes!

mariosasko added 9 commits November 3, 2023 16:00

Remove HfFileSystem.exists override

440158d

Add list_repo_tree to the API

c1fb3d8

Rename BlobLastCommitInfo to LastCommitInfo

7a4a659

Merge branch 'main' of github.com:huggingface/huggingface_hub into op…

49b7099

…timize-hffs-find

Add get_paths_info to the API

d0a3c8f

Improve HfFileSystem tests

2c18e26

Merge branch 'main' of github.com:huggingface/huggingface_hub into op…

b71b5e9

…timize-hffs-find

Faster HfFileSystem.find

c0cd365

Tests

7602d24

mariosasko mentioned this pull request Nov 8, 2023

Optimize HfFileSystem.find #1443

Closed

Merge branch 'main' of github.com:huggingface/huggingface_hub into fa…

8943634

…ster-hffs-find

mariosasko marked this pull request as ready for review November 8, 2023 20:08

mariosasko requested review from Wauplin and lhoestq November 8, 2023 20:08

Wauplin reviewed Nov 9, 2023

View reviewed changes

lhoestq reviewed Nov 9, 2023

View reviewed changes

Update src/huggingface_hub/hf_api.py

fd1b752

Co-authored-by: Lucain <[email protected]>

mariosasko added 2 commits November 9, 2023 18:13

Address the rest of comments

e23dbbc

Fix comment

452b21e

mariosasko requested review from lhoestq and Wauplin November 9, 2023 17:48

lhoestq approved these changes Nov 10, 2023

View reviewed changes

lhoestq mentioned this pull request Nov 10, 2023

Faster HfFileSystem.glob #1815

Merged

Wauplin approved these changes Nov 10, 2023

View reviewed changes

mariosasko merged commit 7297957 into main Nov 10, 2023
12 of 16 checks passed

mariosasko deleted the faster-hffs-find branch November 10, 2023 14:23

This was referenced Nov 15, 2023

feat: 🎸 disable two datasets to avoid load on the Hub huggingface/dataset-viewer#2120

Merged

Unblock two datasets huggingface/dataset-viewer#2125

Closed

lhoestq mentioned this pull request Nov 21, 2023

Fix common path in _ ls_tree #1850

Merged

Wauplin mentioned this pull request Nov 22, 2023

Uncaught KeyError: 'lastCommit' #1853

Closed

This was referenced Dec 14, 2023

Deprecate HfApi.list_files_info #1910

Merged

Use dataclasses for all objects returned by HfApi #1911

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster `HfFileSystem.find` #1809

Faster `HfFileSystem.find` #1809

mariosasko commented Nov 8, 2023

HuggingFaceDocBuilderDev commented Nov 8, 2023 •

edited

Loading

mariosasko commented Nov 8, 2023

Wauplin left a comment

Wauplin Nov 9, 2023

mariosasko Nov 9, 2023

Wauplin Nov 9, 2023

lhoestq Nov 9, 2023

mariosasko Nov 9, 2023

mariosasko Nov 9, 2023

Wauplin Nov 9, 2023

lhoestq left a comment

lhoestq Nov 9, 2023

lhoestq Nov 9, 2023

mariosasko Nov 9, 2023

lhoestq Nov 9, 2023

lhoestq Nov 9, 2023

mariosasko Nov 9, 2023

mariosasko commented Nov 9, 2023

Wauplin commented Nov 9, 2023

lhoestq left a comment

lhoestq commented Nov 10, 2023

Wauplin left a comment

		@@ -226,18 +226,21 @@ def repo_type_and_id_from_hf_id(hf_id: str, hub_url: Optional[str] = None) -> Tu
		return repo_type, namespace, repo_id


		class LastCommitInfo(TypedDict, total=False):

Faster HfFileSystem.find #1809

Faster HfFileSystem.find #1809

Conversation

mariosasko commented Nov 8, 2023

HuggingFaceDocBuilderDev commented Nov 8, 2023 • edited Loading

mariosasko commented Nov 8, 2023

Wauplin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariosasko commented Nov 9, 2023

Wauplin commented Nov 9, 2023

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq commented Nov 10, 2023

Wauplin left a comment

Choose a reason for hiding this comment

Faster `HfFileSystem.find` #1809

Faster `HfFileSystem.find` #1809

HuggingFaceDocBuilderDev commented Nov 8, 2023 •

edited

Loading