Skip to content

Commit

Permalink
Fix broken links in Dataset.md (#49)
Browse files Browse the repository at this point in the history
  • Loading branch information
chauhang authored Jul 30, 2023
2 parents 021ed8e + b61c45d commit 3cc2b37
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions docs/Dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,15 @@ The provided fine tuning script allows you to select between three datasets by p

The list of available datasets can easily be extended with custom datasets by following these instructions.

Each dataset has a corresponding configuration (dataclass) in [configs/dataset.py](../configs/dataset.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.

Additionally, there is a preprocessing function for each dataset in the [ft_datasets](../ft_datasets) folder.
The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling ```model(**data)```.
For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.

To add a custom dataset the following steps need to be performed.

1. Create a dataset configuration after the schema described above. Examples can be found in [configs/dataset.py](../configs/dataset.py).
1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../configs/datasets.py).
2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../utils/dataset_utils.py)
4. Set dataset field in training config to dataset name or use --dataset option of the llama_finetuning.py training script.
Expand Down

0 comments on commit 3cc2b37

Please sign in to comment.