-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TabRepo artifacts to HuggingFace #66
Comments
I took a look and it seems that it would be best to call directly snapshot_download from HF which download the files in parallel and should be quite efficient. One thing though is that we have right now the files on s3 that are listed in the context and this would have to be removed. The way I was thinking to have a similar behavior (be able to download only a subset of tasks) is to just whitelist the names of the files, basically calling Would this option work for you? |
@geoalgo Sounds reasonable. If you want you can start with a toy subset of the data for proof of concept (such as 3 of the smallest datasets). We can then iterate from there. And probably we can host the full artifact via AutoGluon's huggingface account after we confirm it works on the toy example. Regarding partial downloads, whitelisting sounds good, but will have to see how it works in practice. |
Based on the wording: "If provided, only files matching at least one pattern are downloaded." Could we send a list of patterns that are the full file path so it is identical to the current logic? |
Thanks for your answers, I would be keen to be sure that this solution works for you before its implemented otherwise its wasted effort :-)
I do not think we need a proof of concept as the HF hub is quite robust and it is as easy to host 3 datasets or all of them. The main thing I need is for you to add me to AG organization so that I can write files there, alternatively I can create a space just for this dataset.
I can give it a try to have exactly the same logic but it seems to me that what we want is to download everything except for the datasets predictions which are heavy are have to be filtered. A download call could look like this for instance:
This would download only the predictions whose datasets are in the desired context. As far as I can see, the behavior would be identical to the current one. |
Thanks for the response! All of this looks good. Do you foresee any downsides with us creating a new space such as |
Using Autogluon would be perhaps cleaner given that the repository is located into AG github space but I do not mind. I would also have to check if I can create a space for tabrepo (I already created one for synetune not sure if I can easily create many, I will try and let you know! |
Add TabRepo artifacts to HuggingFace for faster downloads and improved visibility.
The text was updated successfully, but these errors were encountered: