Skip to content

Document the HF_DATASETS_CACHE environment variable in the datasets cache documentation #7532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 43 additions & 9 deletions docs/source/cache.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -20,30 +20,64 @@ This guide focuses on the 🤗 Datasets cache and will show you how to:
The default 🤗 Datasets cache directory is `~/.cache/huggingface/datasets`. Change the cache location by setting the shell environment variable, `HF_HOME` to another directory:

```
$ export HF_HOME="/path/to/another/directory/datasets"

\$ export HF\_HOME="/path/to/another/directory/datasets"

```

Alternatively, you can set the `HF_DATASETS_CACHE` environment variable to control only the datasets-specific cache directory:

```

export HF\_DATASETS\_CACHE="/path/to/datasets\_cache"

```

⚠️ This only applies to files written by the `datasets` library (e.g., Arrow files and indices).
It does **not** affect files downloaded from the Hugging Face Hub (such as models, tokenizers, or raw dataset sources), which are controlled separately via the `HF_HUB_CACHE` variable:

```

export HF\_HUB\_CACHE="/path/to/hub\_cache"

```

💡 If you'd like to relocate all Hugging Face caches—including datasets and hub downloads—use the `HF_HOME` variable instead:

```

export HF\_HOME="/path/to/cache\_root"

````

This results in:
- datasets cache → `/path/to/cache_root/datasets`
- hub cache → `/path/to/cache_root/hub`

These distinctions are especially useful when working in shared environments or networked file systems (e.g., NFS).
See [issue #7480](https://github.com/huggingface/datasets/issues/7480) for discussion on how users encountered unexpected cache locations when `HF_HUB_CACHE` was not set alongside `HF_DATASETS_CACHE`.

When you load a dataset, you also have the option to change where the data is cached. Change the `cache_dir` parameter to the path you want:

```py
>>> from datasets import load_dataset
>>> dataset = load_dataset('username/dataset', cache_dir="/path/to/another/directory/datasets")
```
````

## Download mode

After you download a dataset, control how it is loaded by [`load_dataset`] with the `download_mode` parameter. By default, 🤗 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below:
After you download a dataset, control how it is loaded by \[`load_dataset`] with the `download_mode` parameter. By default, 🤗 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below:

```py
>>> from datasets import load_dataset
>>> dataset = load_dataset('rajpurkar/squad', download_mode='force_redownload')
```

Refer to [`DownloadMode`] for a full list of download modes.
Refer to \[`DownloadMode`] for a full list of download modes.

## Cache files

Clean up the Arrow cache files in the directory with [`Dataset.cleanup_cache_files`]:
Clean up the Arrow cache files in the directory with \[`Dataset.cleanup_cache_files`]:

```py
# Returns the number of removed cache files
Expand All @@ -53,15 +87,15 @@ Clean up the Arrow cache files in the directory with [`Dataset.cleanup_cache_fil

## Enable or disable caching

If you're using a cached file locally, it will automatically reload the dataset with any previous transforms you applied to the dataset. Disable this behavior by setting the argument `load_from_cache_file=False` in [`Dataset.map`]:
If you're using a cached file locally, it will automatically reload the dataset with any previous transforms you applied to the dataset. Disable this behavior by setting the argument `load_from_cache_file=False` in \[`Dataset.map`]:

```py
>>> updated_dataset = small_dataset.map(add_prefix, load_from_cache_file=False)
```

In the example above, 🤗 Datasets will execute the function `add_prefix` over the entire dataset again instead of loading the dataset from its previous state.

Disable caching on a global scale with [`disable_caching`]:
Disable caching on a global scale with \[`disable_caching`]:

```py
>>> from datasets import disable_caching
Expand All @@ -72,7 +106,7 @@ When you disable caching, 🤗 Datasets will no longer reload cached files when

<Tip>

If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in [`load_dataset`] instead.
If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in \[`load_dataset`] instead.

</Tip>

Expand All @@ -82,6 +116,6 @@ If you want to reuse a dataset from scratch, try setting the `download_mode` par

Disabling the cache and copying the dataset in-memory will speed up dataset operations. There are two options for copying the dataset in-memory:

1. Set `datasets.config.IN_MEMORY_MAX_SIZE` to a nonzero value (in bytes) that fits in your RAM memory.
1. Set `datasets.config.IN_MEMORY_MAX_SIZE` to a nonzero value (in bytes) that fits in your RAM memory.

2. Set the environment variable `HF_DATASETS_IN_MEMORY_MAX_SIZE` to a nonzero value. Note that the first method takes higher precedence.