Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong exception handling when loading dataset from local disk #173

Open
ganler opened this issue Jun 8, 2024 · 3 comments
Open

Wrong exception handling when loading dataset from local disk #173

ganler opened this issue Jun 8, 2024 · 3 comments

Comments

@ganler
Copy link

ganler commented Jun 8, 2024

try:
# Try first if dataset on a Hub repo
dataset = load_dataset(ds, ds_config, split=split)
except DatasetGenerationError:
# If not, check local dataset
dataset = load_from_disk(os.path.join(ds, split))

Actual exception is ValueError:

[rank5]: Traceback (most recent call last):
[rank5]:   File "run_sft.py", line 251, in <module>
[rank5]:     main()
[rank5]:   File "run_sft.py", line 86, in main
[rank5]:     raw_datasets = get_datasets(
[rank5]:   File "miniconda3/envs/handbook/lib/python3.10/site-packages/alignment/data.py", line 169, in get_datasets
[rank5]:     raw_datasets = mix_datasets(
[rank5]:   File "miniconda3/envs/handbook/lib/python3.10/site-packages/alignment/data.py", line 218, in mix_datasets
[rank5]:     dataset = load_dataset(ds, ds_config, split=split)
[rank5]:   File "miniconda3/envs/handbook/lib/python3.10/site-packages/datasets/load.py", line 2570, in load_dataset
[rank5]:     raise ValueError(
[rank5]: ValueError: You are trying to load a dataset that was saved using `save_to_disk`. Please use `load_from_disk` instead.

Dataset version:

❯ pip show datasets
Name: datasets
Version: 2.19.1
Summary: HuggingFace community-driven open-source library of datasets
Home-page: https://github.com/huggingface/datasets
Author: HuggingFace Inc.
Author-email: [email protected]
License: Apache 2.0
Location: /home/ec2-user/miniconda3/envs/handbook/lib/python3.10/site-packages
Requires: aiohttp, dill, filelock, fsspec, huggingface-hub, multiprocess, numpy, packaging, pandas, pyarrow, pyarrow-hotfix, pyyaml, requests, tqdm, xxhash
Required-by: alignment-handbook, evaluate, trl

Also tried the latest 2.19.2 and got the same error. Need to broaden the exceptions to capture.

@alvarobartt
Copy link
Member

Hi here @ganler, thanks for reporting! Do you want to open a PR to fix the data loading handling? Otherwise, feel free to ping us and we can have a look at it, but as you're pointing out, the following should do the work:

-     except DatasetGenerationError:
+     except ValueError:

Thanks in advance! 🤗

@ganler
Copy link
Author

ganler commented Jul 8, 2024

Thanks! That's how I fixed it temporarily. Feel free to fix it on your side.

@xiyang-aads-lilly
Copy link

xiyang-aads-lilly commented Aug 16, 2024

@alvarobartt seems that this issue has not been fixed in the repo

except DatasetGenerationError:
.

more interesting is that if we use "train" or "test" as splits, it can load the data that saved use save_to_disk function in wrong way. So change to ValueError is just a temp solution. Any suggestion on a better way to handle this problem?
e.g.,

raw_datasets.save_to_disk("data")
datasets.load_dataset("data", split="train")
# output:
Dataset({
    features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
    num_rows: 1
})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants