Skip to content

Commit

Permalink
Deprecate download_custom (huggingface#6093)
Browse files Browse the repository at this point in the history
* Deprecate `download_custom`

* Better msg
  • Loading branch information
mariosasko authored Jul 28, 2023
1 parent a888bc9 commit 50d9a70
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 7 deletions.
6 changes: 0 additions & 6 deletions docs/source/about_dataset_load.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -86,12 +86,6 @@ There are three main methods in [`DatasetBuilder`]:

Once the files are downloaded, [`SplitGenerator`] organizes them into splits. The [`SplitGenerator`] contains the name of the split, and any keyword arguments that are provided to the [`DatasetBuilder._generate_examples`] method. The keyword arguments can be specific to each split, and typically comprise at least the local path to the data files for each split.

<Tip>

[`DownloadManager.download_and_extract`] can download files from a wide range of sources. If the data files are hosted on a special access server, you should use [`DownloadManger.download_custom`]. Refer to the reference of [`DownloadManager`] for more details.

</Tip>

3. [`DatasetBuilder._generate_examples`] reads and parses the data files for a split. Then it yields dataset examples according to the format specified in the `features` from [`DatasetBuilder._info`]. The input of [`DatasetBuilder._generate_examples`] is actually the `filepath` provided in the keyword arguments of the last method.

The dataset is generated with a Python generator, which doesn't load all the data in memory. As a result, the generator can handle large datasets. However, before the generated samples are flushed to the dataset file on disk, they are stored in an `ArrowWriter` buffer. This means the generated samples are written by batch. If your dataset samples consumes a lot of memory (images or videos), then make sure to specify a low value for the `DEFAULT_WRITER_BATCH_SIZE` attribute in [`DatasetBuilder`]. We recommend not exceeding a size of 200 MB.
Expand Down
3 changes: 2 additions & 1 deletion src/datasets/download/download_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
from typing import Callable, Dict, Generator, Iterable, List, Optional, Tuple, Union

from .. import config
from ..utils.deprecation_utils import DeprecatedEnum
from ..utils.deprecation_utils import DeprecatedEnum, deprecated
from ..utils.file_utils import cached_path, get_from_cache, hash_url_to_filename, is_relative_path, url_or_path_join
from ..utils.info_utils import get_size_checksum_dict
from ..utils.logging import get_logger, is_progress_bar_enabled, tqdm
Expand Down Expand Up @@ -349,6 +349,7 @@ def _record_sizes_checksums(self, url_or_urls: NestedDataStructure, downloaded_p
path, record_checksum=self.record_checksums
)

@deprecated("Use `.download`/`.download_and_extract` with `fsspec` URLs instead.")
def download_custom(self, url_or_urls, custom_download):
"""
Download given urls(s) by calling `custom_download`.
Expand Down

0 comments on commit 50d9a70

Please sign in to comment.