Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't call the Hub datasets /tree endpoint with expand=True #1748

Closed
lhoestq opened this issue Sep 1, 2023 · 6 comments
Closed

Don't call the Hub datasets /tree endpoint with expand=True #1748

lhoestq opened this issue Sep 1, 2023 · 6 comments
Labels
blocked-by-upstream The issue must be fixed in a dependency improvement / optimization P1 Not as needed as P0, but still important/wanted

Comments

@lhoestq
Copy link
Member

lhoestq commented Sep 1, 2023

This puts a lot of pressure on the Hub, and can even break it for big datasets

This is because the Hub gets the lastCommit for each file, and somehow the implementation is sort of n^2 apparently.

Calling /tree with expand=True can happen in this cascade of events invoving hffs (aka huggingface_hub.hf_file_system.HfFileSystem):

  1. a dataset without script is updated
  2. the config-parquet-and-info job starts
  3. data files are resolved using hffs.glob
    3a. fsspec calls hffs.find and then hffs.ls with detail=True (even if glob has detail=False)
    3b. hffs calls /tree in with pagination and expand=True
  4. caching information like ETag are obtained using hffs.info in datasets.data_files._get_single_origin_metadata
    4a. hffs calls hffs.ls with detail=True on the parent directory
    4b. hffs calls /tree with pagination and expend=True

One way to stop using expand=True in datasets-server altogether is to mock hffs.ls to not use expand=True.

We also need to replace the file information (currently obtained using expand=True) in hffs.info to not use expand=True. This is only used to get the ETag to check if a dataset has changed between two commits and possibly reuse generated parquet files but it should be fine. We could use the oid instead.

cc @Wauplin @mariosasko for visibility, though it might be too specific to be handled in hfh directly
cc @XciD who reported it

@mariosasko
Copy link
Contributor

I'm surprised that the 4th step is relevant - fsspec filesystem instances are cached by default, so this step should be able to reuse the dircache populated by the glob operation from the previous step. Still, we can optimize it by using the /paths-info endpoint.

Optimizing the glob step is trickier. We use detail=True (requires expand in the /tree endpoint) to remove directories from the resolved files, so one solution is to patch the find call here to set withdirs=False and detail=False, and implement huggingface/huggingface_hub#1443.

@severo severo added blocked-by-upstream The issue must be fixed in a dependency improvement / optimization P1 Not as needed as P0, but still important/wanted labels Sep 4, 2023
@severo
Copy link
Collaborator

severo commented Sep 4, 2023

Do you think this issue can be moved to datasets?

@lhoestq
Copy link
Member Author

lhoestq commented Sep 4, 2023

That would require patching stuff and might not be future-proof, whereas in datasets-server we pin all the versions and run tests every time we want to update dependencies.

The alternative could be to add a new parameter to only return the basic infos in fs.info() (and therefore to fs.ls) like the type, oid, size, and path which would not need expand=True

@lhoestq
Copy link
Member Author

lhoestq commented Sep 15, 2023

Some datasets can't be processed because of this, e.g. taesiri/arxiv_qa

Error code:   ConfigNamesError
Exception:    KeyError
Message:      'lastCommit'
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/dataset/config_names.py", line 55, in compute_config_names_response
                  for config in sorted(get_dataset_config_names(path=dataset, token=hf_token))
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/inspect.py", line 351, in get_dataset_config_names
                  dataset_module = dataset_module_factory(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 1512, in dataset_module_factory
                  raise e1 from None
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 1489, in dataset_module_factory
                  return HubDatasetModuleFactoryWithoutScript(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 1047, in get_module
                  patterns = get_data_patterns(base_path, download_config=self.download_config)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/data_files.py", line 458, in get_data_patterns
                  return _get_data_files_patterns(resolver)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/data_files.py", line 249, in _get_data_files_patterns
                  data_files = pattern_resolver(pattern)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/data_files.py", line 333, in resolve_pattern
                  fs, _, _ = get_fs_token_paths(pattern, storage_options=storage_options)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 622, in get_fs_token_paths
                  paths = [f for f in sorted(fs.glob(paths)) if not fs.isdir(f)]
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/spec.py", line 565, in glob
                  allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/spec.py", line 466, in find
                  for _, dirs, files in self.walk(path, maxdepth, detail=True, **kwargs):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/spec.py", line 440, in walk
                  yield from self.walk(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/spec.py", line 403, in walk
                  listing = self.ls(path, detail=True, **kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/huggingface_hub/hf_file_system.py", line 282, in ls
                  "last_modified": parse_datetime(tree_item["lastCommit"]["date"]),
              KeyError: 'lastCommit'

@severo
Copy link
Collaborator

severo commented Feb 6, 2024

is it still relevant @lhoestq ?

@lhoestq
Copy link
Member Author

lhoestq commented Feb 6, 2024

Nope, we no longer use expand=True in datasets :)

@lhoestq lhoestq closed this as completed Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked-by-upstream The issue must be fixed in a dependency improvement / optimization P1 Not as needed as P0, but still important/wanted
Projects
None yet
Development

No branches or pull requests

3 participants