-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't call the Hub datasets /tree endpoint with expand=True #1748
Comments
I'm surprised that the 4th step is relevant - Optimizing the |
Do you think this issue can be moved to datasets? |
That would require patching stuff and might not be future-proof, whereas in The alternative could be to add a new parameter to only return the basic infos in |
Some datasets can't be processed because of this, e.g. taesiri/arxiv_qa
|
is it still relevant @lhoestq ? |
Nope, we no longer use |
This puts a lot of pressure on the Hub, and can even break it for big datasets
This is because the Hub gets the lastCommit for each file, and somehow the implementation is sort of n^2 apparently.
Calling /tree with
expand=True
can happen in this cascade of events invovinghffs
(akahuggingface_hub.hf_file_system.HfFileSystem
):hffs.glob
3a. fsspec calls
hffs.find
and thenhffs.ls
withdetail=True
(even ifglob
hasdetail=False
)3b.
hffs
calls /tree in with pagination andexpand=True
hffs.info
indatasets.data_files._get_single_origin_metadata
4a.
hffs
callshffs.ls
withdetail=True
on the parent directory4b.
hffs
calls /tree with pagination andexpend=True
One way to stop using
expand=True
indatasets-server
altogether is to mockhffs.ls
to not useexpand=True
.We also need to replace the file information (currently obtained using
expand=True
) inhffs.info
to not useexpand=True
. This is only used to get theETag
to check if a dataset has changed between two commits and possibly reuse generated parquet files but it should be fine. We could use theoid
instead.cc @Wauplin @mariosasko for visibility, though it might be too specific to be handled in
hfh
directlycc @XciD who reported it
The text was updated successfully, but these errors were encountered: