[doc] more guide on loading datsets

RUCAIBox · Jun 6, 2024 · 0938fb8 · 0938fb8
1 parent 56910ea
commit 0938fb8
Show file tree

Hide file tree

Showing 7 changed files with 599 additions and 499 deletions.
diff --git a/README.md b/README.md
@@ -57,7 +57,7 @@ bash bash/run_7b_ds3.sh
 To utilize your model, or evaluate an existing model, you can run the following command:
 
 ```python
-python inference.py -m gpt-3.5-turbo -d copa  # --num_shot 0 --model_type instruction
+python inference.py -m gpt-3.5-turbo -d copa  # --num_shot 0 --model_type chat
 ```
 
 This is default to run the OpenAI GPT 3.5 turbo model on the CoPA dataset in a zero-shot manner.
@@ -118,12 +118,11 @@ We provide a broad support on Huggingface models (e.g. `LLaMA-3`, `Mistral`, or
 Currently a total of 56+ commonly used datasets are supported, including: `HellaSwag`, `MMLU`, `GSM8K`, `GPQA`, `AGIEval`, `CEval`, and `CMMLU`. For a full list of supported models and datasets, view the [utilization](https://github.com/RUCAIBox/LLMBox/tree/main/utilization) documentation.
 
 ```bash
-python inference.py \
+CUDA_VISIBLE_DEVICES=0 python inference.py \
   -m llama-2-7b-hf \
   -d mmlu agieval:[English] \
-  --model_type instruction \
+  --model_type chat \
   --num_shot 5 \
-  --cuda 0 \
   --ranking_type ppl_no_option
 ```
 

diff --git a/docs/examples/customize_dataset.py b/docs/examples/customize_dataset.py
diff --git a/docs/utilization/customize-dataset.md → docs/utilization/how-to-customize-dataset.md b/docs/utilization/customize-dataset.md → docs/utilization/how-to-customize-dataset.md
@@ -2,6 +2,8 @@
 
 If you find some datasets are not supported in the current version, feel free to implement your own dataset and submit a PR.
 
+See a full list of supported datasets at [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/supported-datasets.md).
+
 ## Choose the Right Dataset
 
 We provide two types of datasets: [`GenerationDataset`](https://github.com/RUCAIBox/LLMBox/tree/main/utilization/dataset/generation_dataset.py) and [`MultipleChoiceDataset`](https://github.com/RUCAIBox/LLMBox/tree/main/utilization/dataset/multiple_choice_dataset.py).
@@ -35,7 +37,7 @@ These are the attributes you can define in a new dataset:
 
 - `example_set` (`Optional[str]`): The example split of dataset. Example data will be automatically loaded if this is not None.
 
-- `load_args` (`Union[Tuple[str], Tuple[str, str], Tuple[()]]`, **required\***): Arguments for loading the dataset with huggingface `load_dataset`. See [load from source data](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/customize-dataset.md#load-from-source-data) for details.
+- `load_args` (`Union[Tuple[str], Tuple[str, str], Tuple[()]]`, **required\***): Arguments for loading the dataset with huggingface `load_dataset`. See [load from source data](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-customize-dataset.md#load-from-source-data) for details.
 
 - `extra_model_args` (`Dict[str, Any]`): Extra arguments for the model like `temperature`, `stop` etc. See `set_generation_args`, `set_prob_args`, and `set_ppl_args` for details.
 
@@ -45,7 +47,7 @@ Then implement the following methods or properties:
 - `references` (**required**): Return the reference answers for evaluation.
 - `init_arguments`: Initialize the arguments for the dataset. This is called before the raw dataset is loaded.
 
-See [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/customize-dataset.md#advanced-topics) for advanced topics.
+See [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-customize-dataset.md#advanced-topics) for advanced topics.
 
 
 ## Load from Source Data

diff --git a/docs/utilization/how-to-load-datasets-from-huggingface.md b/docs/utilization/how-to-load-datasets-from-huggingface.md
@@ -0,0 +1,76 @@
+# How to Load Datasets from Hugging Face
+
+In this tutorial, we will learn how to download datasets from Hugging Face using the [`datasets`](https://huggingface.co/docs/datasets/en/index) library. The `datasets` library is a powerful tool that allows you to easily download and work with datasets from [Hugging Face](https://huggingface.co/datasets).
+
+See a full list of supported datasets at [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/supported-datasets.md).
+
+## Case 1: Directly load from Hugging Face
+
+By default, `LLMBox` will handle everything for you. You just need to specify the dataset name in the command line.
+
+```python
+python inference.py -m model -d mmlu
+```
+
+The dataset will be downloaded and cached in the `~/.cache/huggingface/datasets` directory.
+
+## Case 2: Load from a Hugging Face mirror
+
+Datasets
+
+To load a dataset from a Hugging Face mirror, you can use the `--hf_mirror` flag. The dataset will be downloaded from Hugging Face mirror using `hfd.sh`.
+
+This is an experimental feature and may not work in some environments. If you encounter any issues, please let us know.
+
+```shell
+python inference.py -m model -d mmlu --hf_mirror
+```
+
+`hfd.sh` is a slightly modified version of `huggingface-cli` download [wrapper](https://gist.github.com/padeoe/697678ab8e528b85a2a7bddafea1fa4f/), which offers a more stable and faster download speed than the original `huggingface-cli`.
+
+`hfd.sh` will download the dataset from the Hugging Face mirror and cache it in the `~/.cache/huggingface/datasets` directory. Then `datasets` will load the dataset from the cache.
+
+The next time you run the command, `datasets` will directly load the dataset from the cache:
+
+```shell
+python inference.py -m another-model -d mmlu
+```
+
+## Case 3: Load local dataset in offline mode
+
+If you have already downloaded the dataset and want to load it in offline mode, you can use `--dataset_path` to specify the dataset path.
+
+```shell
+python inference.py -m model -d mmlu --dataset_path path/to/mmlu
+```
+
+The dataset will be loaded from the specified path.
+
+
+```bash
+# from a cloned directory of the huggingface dataset repository:
+python inference.py -d copa --dataset_path /path/to/copa
+
+# from a local (nested) directory saved by `dataset.save_to_disk`:
+python inference.py -d race --dataset_path /path/to/race/middle
+python inference.py -d race:middle --dataset_path /path/to/race
+python inference.py -d race:middle --dataset_path /path/to/race/middle
+python inference.py -d race:middle,high --dataset_path /path/to/race
+```
+
+`dataset_path` can also accept a dataset file or a directory containing these files (supports json, jsonl, csv, and txt):
+```bash
+# load one split from one subset only
+python inference.py -d gsm8k --dataset_path /path/to/gsm.jsonl
+python inference.py -d race --dataset_path /path/to/race/middle/train.json
+
+# load test and train splits from middle subset (a directory contains `/path/to/race/middle/train.json` and `/path/to/race/middle/test.json`)
+python inference.py -d race --dataset_path /path/to/race/middle --evaluation_set "test[:10]" --example_set "train"
+
+# load test and train splits from middle and high subsets (a nested directory)
+python inference.py -d race:middle,high --dataset_path /path/to/race --evaluation_set "test[:10]" --example_set "train"
+
+# load test and train splits from middle and high subsets with a filename pattern
+python inference.py -d race:middle,high --evaluation_set "test[:10]" --example_set "train" --dataset_path "/pattern/of/race_{subset}_{split}.json"
+python inference.py -d mmlu --evaluation_set val --example_set dev --dataset_path "/pattern/of/mmlu/{split}/{subset}_{split}.csv"
+```
diff --git a/docs/utilization/how-to-load-datasets-with-subsets.md b/docs/utilization/how-to-load-datasets-with-subsets.md
@@ -0,0 +1,74 @@
+# How to Load Datasets with Subsets
+
+Some datasets have multiple subsets. For example, Massive Multitask Language Understanding (`mmlu`) dataset contains 57 different subsets categorized into four categories: `stem`, `social_sciences`, `humanities`, and `other`.
+
+While some other dataset is a subset of another dataset. For example, Choice Of Plausible Alternatives (`copa`) is a subset of `super_glue`.
+
+See a full list of supported datasets at [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/supported-datasets.md).
+
+## Load from huggingface server
+
+We use the `datasets` library to load the dataset from the huggingface server. If you have issue connecting to the Internet or the Hugging Face server, see [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-load-datasets-from-huggingface.md) for help.
+
+Load a dataset that is a subset of another dataset (e.g. `copa`):
+
+```shell
+python inference.py -d copa
+```
+
+Load a dataset with multiple subsets (e.g. `mmlu`):
+
+```shell
+python inference.py -d mmlu:abstract_algebra,human_sexuality
+```
+
+In some cases, you may want to load a specific split of the dataset (e.g. `test`, `dev`, `validation`, ...). Both `evaluation_set` and `example_set` support the Huggingface [String API](https://huggingface.co/docs/datasets/loading#slice-splits):
+
+```shell
+python inference.py -d race:middle,high --evaluation_set "test[:10]" --example_set "train"
+```
+
+## Understand the behaviour of subsets
+
+By default we load all the subsets of a dataset:
+
+```shell
+python inference.py -m model -d mmlu
+# expands to all 57 subsets
+# equivalent: mmlu:abstract_algebra,human_sexuality,human_sexuality,...
+# equivalent: mmlu:[stem],[social_sciences],[humanities],[other]
+```
+
+```shell
+python inference.py -m model -d arc
+# equivalent: arc:ARC-Easy,ARC-Challenge
+```
+
+Unless a default subset is defined (see [supported datsaets](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/supported-datasets.md) for all the default subsets):
+
+```bash
+python inference.py -m model -d cnn_dailymail
+# equivalent: cnn_dailymail:3.0.0
+```
+
+Some datasets like GPQA (Google-Proof Q&A) have to load example set separately. You need to download the dataset to any directory and provide the path to the dataset:
+
+```bash
+# few_shot
+python inference.py -m model -d gpqa --ranking_type generation -shots 5 --example_set "../gpqa/prompts"
+```
+
+## Overriding `load_raw_dataset` function
+
+Also feel free to override this function if you want to load the dataset in a different way:
+
+```python
+from .utils import load_raw_dataset_from_file, get_raw_dataset_loader
+
+class MyDataset(Dataset):
+    def load_raw_dataset(self, dataset_path, subset_name, evaluation_set, example_set):
+        self.evaluation_data = get_raw_dataset_loader(...)("test")
+        self.example_data = load_raw_dataset_from_file("examples.json")
+```
+
+For more details on how to customize the dataset, see this [guide](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-customize-dataset.md).