Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TabRepo artifacts to HuggingFace #66

Open
Tracked by #63
Innixma opened this issue Jul 11, 2024 · 6 comments
Open
Tracked by #63

Add TabRepo artifacts to HuggingFace #66

Innixma opened this issue Jul 11, 2024 · 6 comments
Assignees
Milestone

Comments

@Innixma
Copy link
Collaborator

Innixma commented Jul 11, 2024

Add TabRepo artifacts to HuggingFace for faster downloads and improved visibility.

@Innixma Innixma added this to the TabRepo 2.0 milestone Jul 11, 2024
@geoalgo
Copy link
Collaborator

geoalgo commented Jul 22, 2024

I took a look and it seems that it would be best to call directly snapshot_download from HF which download the files in parallel and should be quite efficient.

One thing though is that we have right now the files on s3 that are listed in the context and this would have to be removed. The way I was thinking to have a similar behavior (be able to download only a subset of tasks) is to just whitelist the names of the files, basically calling snapshot_download while setting allow_patterns to the list of datasets that are wanted.

Would this option work for you?

@Innixma
Copy link
Collaborator Author

Innixma commented Jul 23, 2024

@geoalgo Sounds reasonable. If you want you can start with a toy subset of the data for proof of concept (such as 3 of the smallest datasets). We can then iterate from there. And probably we can host the full artifact via AutoGluon's huggingface account after we confirm it works on the toy example.

Regarding partial downloads, whitelisting sounds good, but will have to see how it works in practice.

@Innixma
Copy link
Collaborator Author

Innixma commented Jul 23, 2024

Based on the wording: "If provided, only files matching at least one pattern are downloaded."

Could we send a list of patterns that are the full file path so it is identical to the current logic?

@geoalgo
Copy link
Collaborator

geoalgo commented Jul 23, 2024

Thanks for your answers, I would be keen to be sure that this solution works for you before its implemented otherwise its wasted effort :-)

@geoalgo Sounds reasonable. If you want you can start with a toy subset of the data for proof of concept (such as 3 of the smallest datasets). We can then iterate from there. And probably we can host the full artifact via AutoGluon's huggingface account after we confirm it works on the toy example.

I do not think we need a proof of concept as the HF hub is quite robust and it is as easy to host 3 datasets or all of them. The main thing I need is for you to add me to AG organization so that I can write files there, alternatively I can create a space just for this dataset.

Based on the wording: "If provided, only files matching at least one pattern are downloaded."
Could we send a list of patterns that are the full file path so it is identical to the current logic?

I can give it a try to have exactly the same logic but it seems to me that what we want is to download everything except for the datasets predictions which are heavy are have to be filtered. A download call could look like this for instance:

from huggingface_hub import snapshot_download

def download_datasets(datasets: List[str]):
    allow_patterns = [dataset for dataset in datasets] + ["baselines.parquet", "configs.parquet"]
    snapshot_download(
        repo_id="autogluon/tabrepo",
        repo_type="dataset",
        allow_patterns=allow_patterns,
        local_dir="local_path",
    )

This would download only the predictions whose datasets are in the desired context. As far as I can see, the behavior would be identical to the current one.

@Innixma
Copy link
Collaborator Author

Innixma commented Jul 24, 2024

Thanks for the response! All of this looks good.

Do you foresee any downsides with us creating a new space such as tabrepo vs using the autogluon space? Unsure what the limitations are. If you want to go forward with the autogluon space, I can look to grant you write permissions.

@geoalgo
Copy link
Collaborator

geoalgo commented Jul 25, 2024

Using Autogluon would be perhaps cleaner given that the repository is located into AG github space but I do not mind. I would also have to check if I can create a space for tabrepo (I already created one for synetune not sure if I can easily create many, I will try and let you know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants