Add TabRepo artifacts to HuggingFace #66

Innixma · 2024-07-11T23:31:04Z

Add TabRepo artifacts to HuggingFace for faster downloads and improved visibility.

geoalgo · 2024-07-22T14:58:35Z

I took a look and it seems that it would be best to call directly snapshot_download from HF which download the files in parallel and should be quite efficient.

One thing though is that we have right now the files on s3 that are listed in the context and this would have to be removed. The way I was thinking to have a similar behavior (be able to download only a subset of tasks) is to just whitelist the names of the files, basically calling snapshot_download while setting allow_patterns to the list of datasets that are wanted.

Would this option work for you?

Innixma · 2024-07-23T00:28:16Z

@geoalgo Sounds reasonable. If you want you can start with a toy subset of the data for proof of concept (such as 3 of the smallest datasets). We can then iterate from there. And probably we can host the full artifact via AutoGluon's huggingface account after we confirm it works on the toy example.

Regarding partial downloads, whitelisting sounds good, but will have to see how it works in practice.

Innixma · 2024-07-23T00:30:17Z

Based on the wording: "If provided, only files matching at least one pattern are downloaded."

Could we send a list of patterns that are the full file path so it is identical to the current logic?

geoalgo · 2024-07-23T07:49:53Z

Thanks for your answers, I would be keen to be sure that this solution works for you before its implemented otherwise its wasted effort :-)

@geoalgo Sounds reasonable. If you want you can start with a toy subset of the data for proof of concept (such as 3 of the smallest datasets). We can then iterate from there. And probably we can host the full artifact via AutoGluon's huggingface account after we confirm it works on the toy example.

I do not think we need a proof of concept as the HF hub is quite robust and it is as easy to host 3 datasets or all of them. The main thing I need is for you to add me to AG organization so that I can write files there, alternatively I can create a space just for this dataset.

Based on the wording: "If provided, only files matching at least one pattern are downloaded."
Could we send a list of patterns that are the full file path so it is identical to the current logic?

I can give it a try to have exactly the same logic but it seems to me that what we want is to download everything except for the datasets predictions which are heavy are have to be filtered. A download call could look like this for instance:

from huggingface_hub import snapshot_download

def download_datasets(datasets: List[str]):
    allow_patterns = [dataset for dataset in datasets] + ["baselines.parquet", "configs.parquet"]
    snapshot_download(
        repo_id="autogluon/tabrepo",
        repo_type="dataset",
        allow_patterns=allow_patterns,
        local_dir="local_path",
    )

This would download only the predictions whose datasets are in the desired context. As far as I can see, the behavior would be identical to the current one.

Innixma · 2024-07-24T18:33:29Z

Thanks for the response! All of this looks good.

Do you foresee any downsides with us creating a new space such as tabrepo vs using the autogluon space? Unsure what the limitations are. If you want to go forward with the autogluon space, I can look to grant you write permissions.

geoalgo · 2024-07-25T07:25:53Z

Using Autogluon would be perhaps cleaner given that the repository is located into AG github space but I do not mind. I would also have to check if I can create a space for tabrepo (I already created one for synetune not sure if I can easily create many, I will try and let you know!

Innixma added this to the TabRepo 2.0 milestone Jul 11, 2024

Innixma assigned geoalgo Jul 11, 2024

Innixma mentioned this issue Jul 11, 2024

TabRepo 2.0 Feature Tracker #63

Open

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TabRepo artifacts to HuggingFace #66

Add TabRepo artifacts to HuggingFace #66

Innixma commented Jul 11, 2024

geoalgo commented Jul 22, 2024 •

edited

Loading

Innixma commented Jul 23, 2024 •

edited

Loading

Innixma commented Jul 23, 2024

geoalgo commented Jul 23, 2024 •

edited

Loading

Innixma commented Jul 24, 2024

geoalgo commented Jul 25, 2024

Add TabRepo artifacts to HuggingFace #66

Add TabRepo artifacts to HuggingFace #66

Comments

Innixma commented Jul 11, 2024

geoalgo commented Jul 22, 2024 • edited Loading

Innixma commented Jul 23, 2024 • edited Loading

Innixma commented Jul 23, 2024

geoalgo commented Jul 23, 2024 • edited Loading

Innixma commented Jul 24, 2024

geoalgo commented Jul 25, 2024

geoalgo commented Jul 22, 2024 •

edited

Loading

Innixma commented Jul 23, 2024 •

edited

Loading

geoalgo commented Jul 23, 2024 •

edited

Loading