-
Notifications
You must be signed in to change notification settings - Fork 91
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
7 changed files
with
599 additions
and
499 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# How to Load Datasets from Hugging Face | ||
|
||
In this tutorial, we will learn how to download datasets from Hugging Face using the [`datasets`](https://huggingface.co/docs/datasets/en/index) library. The `datasets` library is a powerful tool that allows you to easily download and work with datasets from [Hugging Face](https://huggingface.co/datasets). | ||
|
||
See a full list of supported datasets at [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/supported-datasets.md). | ||
|
||
## Case 1: Directly load from Hugging Face | ||
|
||
By default, `LLMBox` will handle everything for you. You just need to specify the dataset name in the command line. | ||
|
||
```python | ||
python inference.py -m model -d mmlu | ||
``` | ||
|
||
The dataset will be downloaded and cached in the `~/.cache/huggingface/datasets` directory. | ||
|
||
## Case 2: Load from a Hugging Face mirror | ||
|
||
Datasets | ||
|
||
To load a dataset from a Hugging Face mirror, you can use the `--hf_mirror` flag. The dataset will be downloaded from Hugging Face mirror using `hfd.sh`. | ||
|
||
This is an experimental feature and may not work in some environments. If you encounter any issues, please let us know. | ||
|
||
```shell | ||
python inference.py -m model -d mmlu --hf_mirror | ||
``` | ||
|
||
`hfd.sh` is a slightly modified version of `huggingface-cli` download [wrapper](https://gist.github.com/padeoe/697678ab8e528b85a2a7bddafea1fa4f/), which offers a more stable and faster download speed than the original `huggingface-cli`. | ||
|
||
`hfd.sh` will download the dataset from the Hugging Face mirror and cache it in the `~/.cache/huggingface/datasets` directory. Then `datasets` will load the dataset from the cache. | ||
|
||
The next time you run the command, `datasets` will directly load the dataset from the cache: | ||
|
||
```shell | ||
python inference.py -m another-model -d mmlu | ||
``` | ||
|
||
## Case 3: Load local dataset in offline mode | ||
|
||
If you have already downloaded the dataset and want to load it in offline mode, you can use `--dataset_path` to specify the dataset path. | ||
|
||
```shell | ||
python inference.py -m model -d mmlu --dataset_path path/to/mmlu | ||
``` | ||
|
||
The dataset will be loaded from the specified path. | ||
|
||
|
||
```bash | ||
# from a cloned directory of the huggingface dataset repository: | ||
python inference.py -d copa --dataset_path /path/to/copa | ||
|
||
# from a local (nested) directory saved by `dataset.save_to_disk`: | ||
python inference.py -d race --dataset_path /path/to/race/middle | ||
python inference.py -d race:middle --dataset_path /path/to/race | ||
python inference.py -d race:middle --dataset_path /path/to/race/middle | ||
python inference.py -d race:middle,high --dataset_path /path/to/race | ||
``` | ||
|
||
`dataset_path` can also accept a dataset file or a directory containing these files (supports json, jsonl, csv, and txt): | ||
```bash | ||
# load one split from one subset only | ||
python inference.py -d gsm8k --dataset_path /path/to/gsm.jsonl | ||
python inference.py -d race --dataset_path /path/to/race/middle/train.json | ||
|
||
# load test and train splits from middle subset (a directory contains `/path/to/race/middle/train.json` and `/path/to/race/middle/test.json`) | ||
python inference.py -d race --dataset_path /path/to/race/middle --evaluation_set "test[:10]" --example_set "train" | ||
|
||
# load test and train splits from middle and high subsets (a nested directory) | ||
python inference.py -d race:middle,high --dataset_path /path/to/race --evaluation_set "test[:10]" --example_set "train" | ||
|
||
# load test and train splits from middle and high subsets with a filename pattern | ||
python inference.py -d race:middle,high --evaluation_set "test[:10]" --example_set "train" --dataset_path "/pattern/of/race_{subset}_{split}.json" | ||
python inference.py -d mmlu --evaluation_set val --example_set dev --dataset_path "/pattern/of/mmlu/{split}/{subset}_{split}.csv" | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
# How to Load Datasets with Subsets | ||
|
||
Some datasets have multiple subsets. For example, Massive Multitask Language Understanding (`mmlu`) dataset contains 57 different subsets categorized into four categories: `stem`, `social_sciences`, `humanities`, and `other`. | ||
|
||
While some other dataset is a subset of another dataset. For example, Choice Of Plausible Alternatives (`copa`) is a subset of `super_glue`. | ||
|
||
See a full list of supported datasets at [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/supported-datasets.md). | ||
|
||
## Load from huggingface server | ||
|
||
We use the `datasets` library to load the dataset from the huggingface server. If you have issue connecting to the Internet or the Hugging Face server, see [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-load-datasets-from-huggingface.md) for help. | ||
|
||
Load a dataset that is a subset of another dataset (e.g. `copa`): | ||
|
||
```shell | ||
python inference.py -d copa | ||
``` | ||
|
||
Load a dataset with multiple subsets (e.g. `mmlu`): | ||
|
||
```shell | ||
python inference.py -d mmlu:abstract_algebra,human_sexuality | ||
``` | ||
|
||
In some cases, you may want to load a specific split of the dataset (e.g. `test`, `dev`, `validation`, ...). Both `evaluation_set` and `example_set` support the Huggingface [String API](https://huggingface.co/docs/datasets/loading#slice-splits): | ||
|
||
```shell | ||
python inference.py -d race:middle,high --evaluation_set "test[:10]" --example_set "train" | ||
``` | ||
|
||
## Understand the behaviour of subsets | ||
|
||
By default we load all the subsets of a dataset: | ||
|
||
```shell | ||
python inference.py -m model -d mmlu | ||
# expands to all 57 subsets | ||
# equivalent: mmlu:abstract_algebra,human_sexuality,human_sexuality,... | ||
# equivalent: mmlu:[stem],[social_sciences],[humanities],[other] | ||
``` | ||
|
||
```shell | ||
python inference.py -m model -d arc | ||
# equivalent: arc:ARC-Easy,ARC-Challenge | ||
``` | ||
|
||
Unless a default subset is defined (see [supported datsaets](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/supported-datasets.md) for all the default subsets): | ||
|
||
```bash | ||
python inference.py -m model -d cnn_dailymail | ||
# equivalent: cnn_dailymail:3.0.0 | ||
``` | ||
|
||
Some datasets like GPQA (Google-Proof Q&A) have to load example set separately. You need to download the dataset to any directory and provide the path to the dataset: | ||
|
||
```bash | ||
# few_shot | ||
python inference.py -m model -d gpqa --ranking_type generation -shots 5 --example_set "../gpqa/prompts" | ||
``` | ||
|
||
## Overriding `load_raw_dataset` function | ||
|
||
Also feel free to override this function if you want to load the dataset in a different way: | ||
|
||
```python | ||
from .utils import load_raw_dataset_from_file, get_raw_dataset_loader | ||
|
||
class MyDataset(Dataset): | ||
def load_raw_dataset(self, dataset_path, subset_name, evaluation_set, example_set): | ||
self.evaluation_data = get_raw_dataset_loader(...)("test") | ||
self.example_data = load_raw_dataset_from_file("examples.json") | ||
``` | ||
|
||
For more details on how to customize the dataset, see this [guide](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-customize-dataset.md). |
Oops, something went wrong.