Don't call the Hub datasets /tree endpoint with expand=True #1748

lhoestq · 2023-09-01T10:04:59Z

This puts a lot of pressure on the Hub, and can even break it for big datasets

This is because the Hub gets the lastCommit for each file, and somehow the implementation is sort of n^2 apparently.

Calling /tree with expand=True can happen in this cascade of events invoving hffs (aka huggingface_hub.hf_file_system.HfFileSystem):

a dataset without script is updated
the config-parquet-and-info job starts
data files are resolved using hffs.glob
3a. fsspec calls hffs.find and then hffs.ls with detail=True (even if glob has detail=False)
3b. hffs calls /tree in with pagination and expand=True
caching information like ETag are obtained using hffs.info in datasets.data_files._get_single_origin_metadata
4a. hffs calls hffs.ls with detail=True on the parent directory
4b. hffs calls /tree with pagination and expend=True

One way to stop using expand=True in datasets-server altogether is to mock hffs.ls to not use expand=True.

We also need to replace the file information (currently obtained using expand=True) in hffs.info to not use expand=True. This is only used to get the ETag to check if a dataset has changed between two commits and possibly reuse generated parquet files but it should be fine. We could use the oid instead.

cc @Wauplin @mariosasko for visibility, though it might be too specific to be handled in hfh directly
cc @XciD who reported it

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-09-01T16:03:02Z

I'm surprised that the 4th step is relevant - fsspec filesystem instances are cached by default, so this step should be able to reuse the dircache populated by the glob operation from the previous step. Still, we can optimize it by using the /paths-info endpoint.

Optimizing the glob step is trickier. We use detail=True (requires expand in the /tree endpoint) to remove directories from the resolved files, so one solution is to patch the find call here to set withdirs=False and detail=False, and implement huggingface/huggingface_hub#1443.

severo · 2023-09-04T13:46:46Z

Do you think this issue can be moved to datasets?

lhoestq · 2023-09-04T14:09:19Z

That would require patching stuff and might not be future-proof, whereas in datasets-server we pin all the versions and run tests every time we want to update dependencies.

The alternative could be to add a new parameter to only return the basic infos in fs.info() (and therefore to fs.ls) like the type, oid, size, and path which would not need expand=True

lhoestq · 2023-09-15T18:07:32Z

Some datasets can't be processed because of this, e.g. taesiri/arxiv_qa

Error code:   ConfigNamesError
Exception:    KeyError
Message:      'lastCommit'
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/dataset/config_names.py", line 55, in compute_config_names_response
                  for config in sorted(get_dataset_config_names(path=dataset, token=hf_token))
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/inspect.py", line 351, in get_dataset_config_names
                  dataset_module = dataset_module_factory(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 1512, in dataset_module_factory
                  raise e1 from None
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 1489, in dataset_module_factory
                  return HubDatasetModuleFactoryWithoutScript(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 1047, in get_module
                  patterns = get_data_patterns(base_path, download_config=self.download_config)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/data_files.py", line 458, in get_data_patterns
                  return _get_data_files_patterns(resolver)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/data_files.py", line 249, in _get_data_files_patterns
                  data_files = pattern_resolver(pattern)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/data_files.py", line 333, in resolve_pattern
                  fs, _, _ = get_fs_token_paths(pattern, storage_options=storage_options)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 622, in get_fs_token_paths
                  paths = [f for f in sorted(fs.glob(paths)) if not fs.isdir(f)]
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/spec.py", line 565, in glob
                  allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/spec.py", line 466, in find
                  for _, dirs, files in self.walk(path, maxdepth, detail=True, **kwargs):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/spec.py", line 440, in walk
                  yield from self.walk(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/spec.py", line 403, in walk
                  listing = self.ls(path, detail=True, **kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/huggingface_hub/hf_file_system.py", line 282, in ls
                  "last_modified": parse_datetime(tree_item["lastCommit"]["date"]),
              KeyError: 'lastCommit'

severo · 2024-02-06T15:02:51Z

is it still relevant @lhoestq ?

lhoestq · 2024-02-06T15:04:04Z

Nope, we no longer use expand=True in datasets :)

severo added blocked-by-upstream The issue must be fixed in a dependency improvement / optimization P1 Not as needed as P0, but still important/wanted labels Sep 4, 2023

lhoestq closed this as completed Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't call the Hub datasets /tree endpoint with expand=True #1748

Don't call the Hub datasets /tree endpoint with expand=True #1748

lhoestq commented Sep 1, 2023 •

edited

Loading

mariosasko commented Sep 1, 2023

severo commented Sep 4, 2023

lhoestq commented Sep 4, 2023 •

edited

Loading

lhoestq commented Sep 15, 2023

severo commented Feb 6, 2024

lhoestq commented Feb 6, 2024

Don't call the Hub datasets /tree endpoint with expand=True #1748

Don't call the Hub datasets /tree endpoint with expand=True #1748

Comments

lhoestq commented Sep 1, 2023 • edited Loading

mariosasko commented Sep 1, 2023

severo commented Sep 4, 2023

lhoestq commented Sep 4, 2023 • edited Loading

lhoestq commented Sep 15, 2023

severo commented Feb 6, 2024

lhoestq commented Feb 6, 2024

lhoestq commented Sep 1, 2023 •

edited

Loading

lhoestq commented Sep 4, 2023 •

edited

Loading