Datasets try to load locally although streaming is set to True in Kaggle notebook. #6768

RitchieP · 2024-04-01T09:13:34Z

RitchieP
Apr 1, 2024

Currently, I have a dataset hosted on Huggingface with a custom script here.

I'm loading my dataset as below.

from datasets import load_dataset, IterableDatasetDict

dataset = IterableDatasetDict()

dataset["train"] = load_dataset("RitchieP/VerbaLex_voice", "ar", split="train", use_auth_token=True, streaming=True)
dataset["test"] = load_dataset("RitchieP/VerbaLex_voice", "ar", split="test", use_auth_token=True, streaming=True)

And when I try to see the data I have loaded with

list(dataset["train"].take(1))

And it gives me this stack trace

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[2], line 1
----> 1 list(dataset["train"].take(1))

File /opt/conda/lib/python3.10/site-packages/datasets/iterable_dataset.py:1388, in IterableDataset.__iter__(self)
   1385         yield formatter.format_row(pa_table)
   1386     return
-> 1388 for key, example in ex_iterable:
   1389     if self.features:
   1390         # `IterableDataset` automatically fills missing columns with None.
   1391         # This is done with `_apply_feature_types_on_example`.
   1392         example = _apply_feature_types_on_example(
   1393             example, self.features, token_per_repo_id=self._token_per_repo_id
   1394         )

File /opt/conda/lib/python3.10/site-packages/datasets/iterable_dataset.py:1044, in TakeExamplesIterable.__iter__(self)
   1043 def __iter__(self):
-> 1044     yield from islice(self.ex_iterable, self.n)

File /opt/conda/lib/python3.10/site-packages/datasets/iterable_dataset.py:234, in ExamplesIterable.__iter__(self)
    233 def __iter__(self):
--> 234     yield from self.generate_examples_fn(**self.kwargs)

File ~/.cache/huggingface/modules/datasets_modules/datasets/RitchieP--VerbaLex_voice/9465eaee58383cf9d7c3e14111d7abaea56398185a641b646897d6df4e4732f7/VerbaLex_voice.py:127, in VerbaLexVoiceDataset._generate_examples(self, local_extracted_archive_paths, archives, meta_path)
    125 for i, audio_archive in enumerate(archives):
    126     print(audio_archive)
--> 127     for path, file in audio_archive:
    128         _, filename = os.path.split(path)
    129         if filename in metadata:

File /opt/conda/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py:869, in _IterableFromGenerator.__iter__(self)
    868 def __iter__(self):
--> 869     yield from self.generator(*self.args, **self.kwargs)

File /opt/conda/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py:919, in ArchiveIterable._iter_from_urlpath(cls, urlpath, download_config)
    915 @classmethod
    916 def _iter_from_urlpath(
    917     cls, urlpath: str, download_config: Optional[DownloadConfig] = None
    918 ) -> Generator[Tuple, None, None]:
--> 919     compression = _get_extraction_protocol(urlpath, download_config=download_config)
    920     # Set block_size=0 to get faster streaming
    921     # (e.g. for hf:// and https:// it uses streaming Requests file-like instances)
    922     with xopen(urlpath, "rb", download_config=download_config, block_size=0) as f:

File /opt/conda/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py:400, in _get_extraction_protocol(urlpath, download_config)
    398 urlpath, storage_options = _prepare_path_and_storage_options(urlpath, download_config=download_config)
    399 try:
--> 400     with fsspec.open(urlpath, **(storage_options or {})) as f:
    401         return _get_extraction_protocol_with_magic_number(f)
    402 except FileNotFoundError:

File /opt/conda/lib/python3.10/site-packages/fsspec/core.py:100, in OpenFile.__enter__(self)
     97 def __enter__(self):
     98     mode = self.mode.replace("t", "").replace("b", "") + "b"
--> 100     f = self.fs.open(self.path, mode=mode)
    102     self.fobjects = [f]
    104     if self.compression is not None:

File /opt/conda/lib/python3.10/site-packages/fsspec/spec.py:1307, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
   1305 else:
   1306     ac = kwargs.pop("autocommit", not self._intrans)
-> 1307     f = self._open(
   1308         path,
   1309         mode=mode,
   1310         block_size=block_size,
   1311         autocommit=ac,
   1312         cache_options=cache_options,
   1313         **kwargs,
   1314     )
   1315     if compression is not None:
   1316         from fsspec.compression import compr

File /opt/conda/lib/python3.10/site-packages/fsspec/implementations/local.py:180, in LocalFileSystem._open(self, path, mode, block_size, **kwargs)
    178 if self.auto_mkdir and "w" in mode:
    179     self.makedirs(self._parent(path), exist_ok=True)
--> 180 return LocalFileOpener(path, mode, fs=self, **kwargs)

File /opt/conda/lib/python3.10/site-packages/fsspec/implementations/local.py:302, in LocalFileOpener.__init__(self, path, mode, autocommit, fs, compression, **kwargs)
    300 self.compression = get_compression(path, compression)
    301 self.blocksize = io.DEFAULT_BUFFER_SIZE
--> 302 self._open()

File /opt/conda/lib/python3.10/site-packages/fsspec/implementations/local.py:307, in LocalFileOpener._open(self)
    305 if self.f is None or self.f.closed:
    306     if self.autocommit or "w" not in self.mode:
--> 307         self.f = open(self.path, mode=self.mode)
    308         if self.compression:
    309             compress = compr[self.compression]

FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/working/h'

After looking into the stack trace, and referring to the source codes, it looks like its trying to access a directory in the notebook's environment and I don't understand why.

Not sure if its a bug in Datasets library, so I'm opening a discussions first. Feel free to ask for more information if needed. Appreciate any help in advance!

Answered by RitchieP

Apr 4, 2024

This issue has been solved. The solution for the issue is within a PR for the dataset in Huggingface Hub linked below.

https://huggingface.co/datasets/RitchieP/VerbaLex_voice/discussions/6

View full answer

RitchieP · 2024-04-04T14:23:32Z

RitchieP
Apr 4, 2024
Author

This issue has been solved. The solution for the issue is within a PR for the dataset in Huggingface Hub linked below.

https://huggingface.co/datasets/RitchieP/VerbaLex_voice/discussions/6

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Datasets try to load locally although streaming is set to True in Kaggle notebook. #6768

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Datasets try to load locally although streaming is set to True in Kaggle notebook. #6768

Uh oh!

RitchieP Apr 1, 2024

Replies: 1 comment

Uh oh!

RitchieP Apr 4, 2024 Author

RitchieP
Apr 1, 2024

RitchieP
Apr 4, 2024
Author