Skip to content

Commit

Permalink
[doc] more guide on loading datsets
Browse files Browse the repository at this point in the history
  • Loading branch information
huyiwen committed Jun 6, 2024
1 parent 56910ea commit 0938fb8
Show file tree
Hide file tree
Showing 7 changed files with 599 additions and 499 deletions.
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ bash bash/run_7b_ds3.sh
To utilize your model, or evaluate an existing model, you can run the following command:

```python
python inference.py -m gpt-3.5-turbo -d copa # --num_shot 0 --model_type instruction
python inference.py -m gpt-3.5-turbo -d copa # --num_shot 0 --model_type chat
```

This is default to run the OpenAI GPT 3.5 turbo model on the CoPA dataset in a zero-shot manner.
Expand Down Expand Up @@ -118,12 +118,11 @@ We provide a broad support on Huggingface models (e.g. `LLaMA-3`, `Mistral`, or
Currently a total of 56+ commonly used datasets are supported, including: `HellaSwag`, `MMLU`, `GSM8K`, `GPQA`, `AGIEval`, `CEval`, and `CMMLU`. For a full list of supported models and datasets, view the [utilization](https://github.com/RUCAIBox/LLMBox/tree/main/utilization) documentation.

```bash
python inference.py \
CUDA_VISIBLE_DEVICES=0 python inference.py \
-m llama-2-7b-hf \
-d mmlu agieval:[English] \
--model_type instruction \
--model_type chat \
--num_shot 5 \
--cuda 0 \
--ranking_type ppl_no_option
```

Expand Down
Empty file.
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

If you find some datasets are not supported in the current version, feel free to implement your own dataset and submit a PR.

See a full list of supported datasets at [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/supported-datasets.md).

## Choose the Right Dataset

We provide two types of datasets: [`GenerationDataset`](https://github.com/RUCAIBox/LLMBox/tree/main/utilization/dataset/generation_dataset.py) and [`MultipleChoiceDataset`](https://github.com/RUCAIBox/LLMBox/tree/main/utilization/dataset/multiple_choice_dataset.py).
Expand Down Expand Up @@ -35,7 +37,7 @@ These are the attributes you can define in a new dataset:

- `example_set` (`Optional[str]`): The example split of dataset. Example data will be automatically loaded if this is not None.

- `load_args` (`Union[Tuple[str], Tuple[str, str], Tuple[()]]`, **required\***): Arguments for loading the dataset with huggingface `load_dataset`. See [load from source data](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/customize-dataset.md#load-from-source-data) for details.
- `load_args` (`Union[Tuple[str], Tuple[str, str], Tuple[()]]`, **required\***): Arguments for loading the dataset with huggingface `load_dataset`. See [load from source data](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-customize-dataset.md#load-from-source-data) for details.

- `extra_model_args` (`Dict[str, Any]`): Extra arguments for the model like `temperature`, `stop` etc. See `set_generation_args`, `set_prob_args`, and `set_ppl_args` for details.

Expand All @@ -45,7 +47,7 @@ Then implement the following methods or properties:
- `references` (**required**): Return the reference answers for evaluation.
- `init_arguments`: Initialize the arguments for the dataset. This is called before the raw dataset is loaded.

See [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/customize-dataset.md#advanced-topics) for advanced topics.
See [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-customize-dataset.md#advanced-topics) for advanced topics.


## Load from Source Data
Expand Down
76 changes: 76 additions & 0 deletions docs/utilization/how-to-load-datasets-from-huggingface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# How to Load Datasets from Hugging Face

In this tutorial, we will learn how to download datasets from Hugging Face using the [`datasets`](https://huggingface.co/docs/datasets/en/index) library. The `datasets` library is a powerful tool that allows you to easily download and work with datasets from [Hugging Face](https://huggingface.co/datasets).

See a full list of supported datasets at [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/supported-datasets.md).

## Case 1: Directly load from Hugging Face

By default, `LLMBox` will handle everything for you. You just need to specify the dataset name in the command line.

```python
python inference.py -m model -d mmlu
```

The dataset will be downloaded and cached in the `~/.cache/huggingface/datasets` directory.

## Case 2: Load from a Hugging Face mirror

Datasets

To load a dataset from a Hugging Face mirror, you can use the `--hf_mirror` flag. The dataset will be downloaded from Hugging Face mirror using `hfd.sh`.

This is an experimental feature and may not work in some environments. If you encounter any issues, please let us know.

```shell
python inference.py -m model -d mmlu --hf_mirror
```

`hfd.sh` is a slightly modified version of `huggingface-cli` download [wrapper](https://gist.github.com/padeoe/697678ab8e528b85a2a7bddafea1fa4f/), which offers a more stable and faster download speed than the original `huggingface-cli`.

`hfd.sh` will download the dataset from the Hugging Face mirror and cache it in the `~/.cache/huggingface/datasets` directory. Then `datasets` will load the dataset from the cache.

The next time you run the command, `datasets` will directly load the dataset from the cache:

```shell
python inference.py -m another-model -d mmlu
```

## Case 3: Load local dataset in offline mode

If you have already downloaded the dataset and want to load it in offline mode, you can use `--dataset_path` to specify the dataset path.

```shell
python inference.py -m model -d mmlu --dataset_path path/to/mmlu
```

The dataset will be loaded from the specified path.


```bash
# from a cloned directory of the huggingface dataset repository:
python inference.py -d copa --dataset_path /path/to/copa

# from a local (nested) directory saved by `dataset.save_to_disk`:
python inference.py -d race --dataset_path /path/to/race/middle
python inference.py -d race:middle --dataset_path /path/to/race
python inference.py -d race:middle --dataset_path /path/to/race/middle
python inference.py -d race:middle,high --dataset_path /path/to/race
```

`dataset_path` can also accept a dataset file or a directory containing these files (supports json, jsonl, csv, and txt):
```bash
# load one split from one subset only
python inference.py -d gsm8k --dataset_path /path/to/gsm.jsonl
python inference.py -d race --dataset_path /path/to/race/middle/train.json

# load test and train splits from middle subset (a directory contains `/path/to/race/middle/train.json` and `/path/to/race/middle/test.json`)
python inference.py -d race --dataset_path /path/to/race/middle --evaluation_set "test[:10]" --example_set "train"

# load test and train splits from middle and high subsets (a nested directory)
python inference.py -d race:middle,high --dataset_path /path/to/race --evaluation_set "test[:10]" --example_set "train"

# load test and train splits from middle and high subsets with a filename pattern
python inference.py -d race:middle,high --evaluation_set "test[:10]" --example_set "train" --dataset_path "/pattern/of/race_{subset}_{split}.json"
python inference.py -d mmlu --evaluation_set val --example_set dev --dataset_path "/pattern/of/mmlu/{split}/{subset}_{split}.csv"
```
74 changes: 74 additions & 0 deletions docs/utilization/how-to-load-datasets-with-subsets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# How to Load Datasets with Subsets

Some datasets have multiple subsets. For example, Massive Multitask Language Understanding (`mmlu`) dataset contains 57 different subsets categorized into four categories: `stem`, `social_sciences`, `humanities`, and `other`.

While some other dataset is a subset of another dataset. For example, Choice Of Plausible Alternatives (`copa`) is a subset of `super_glue`.

See a full list of supported datasets at [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/supported-datasets.md).

## Load from huggingface server

We use the `datasets` library to load the dataset from the huggingface server. If you have issue connecting to the Internet or the Hugging Face server, see [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-load-datasets-from-huggingface.md) for help.

Load a dataset that is a subset of another dataset (e.g. `copa`):

```shell
python inference.py -d copa
```

Load a dataset with multiple subsets (e.g. `mmlu`):

```shell
python inference.py -d mmlu:abstract_algebra,human_sexuality
```

In some cases, you may want to load a specific split of the dataset (e.g. `test`, `dev`, `validation`, ...). Both `evaluation_set` and `example_set` support the Huggingface [String API](https://huggingface.co/docs/datasets/loading#slice-splits):

```shell
python inference.py -d race:middle,high --evaluation_set "test[:10]" --example_set "train"
```

## Understand the behaviour of subsets

By default we load all the subsets of a dataset:

```shell
python inference.py -m model -d mmlu
# expands to all 57 subsets
# equivalent: mmlu:abstract_algebra,human_sexuality,human_sexuality,...
# equivalent: mmlu:[stem],[social_sciences],[humanities],[other]
```

```shell
python inference.py -m model -d arc
# equivalent: arc:ARC-Easy,ARC-Challenge
```

Unless a default subset is defined (see [supported datsaets](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/supported-datasets.md) for all the default subsets):

```bash
python inference.py -m model -d cnn_dailymail
# equivalent: cnn_dailymail:3.0.0
```

Some datasets like GPQA (Google-Proof Q&A) have to load example set separately. You need to download the dataset to any directory and provide the path to the dataset:

```bash
# few_shot
python inference.py -m model -d gpqa --ranking_type generation -shots 5 --example_set "../gpqa/prompts"
```

## Overriding `load_raw_dataset` function

Also feel free to override this function if you want to load the dataset in a different way:

```python
from .utils import load_raw_dataset_from_file, get_raw_dataset_loader

class MyDataset(Dataset):
def load_raw_dataset(self, dataset_path, subset_name, evaluation_set, example_set):
self.evaluation_data = get_raw_dataset_loader(...)("test")
self.example_data = load_raw_dataset_from_file("examples.json")
```

For more details on how to customize the dataset, see this [guide](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-customize-dataset.md).
Loading

0 comments on commit 0938fb8

Please sign in to comment.