-
Notifications
You must be signed in to change notification settings - Fork 10
Data Preparation
The pipeline expects a custom dataset format, all to ensure that you actually feed it standardized datasets, and to save me the headache of trying to account for every dataset format implicitly within the code.
So the dataset format that it expects goes like this:
{"init": "Some system message", "conversations": [{"from": "human", "value": "user said hi"}, {"from": "gpt", "value": "AI said hi in return"}, {"from": "human", "value": "user says some verbal war crimes"}], "source": "aaa_dataset", "tags": []}
{"init": "", "conversations": [{"from": "human", "value": "Some completion slop"}], "source": "completion_dataset", "tags": ["completion"]}
For completion it doesn't matter from
who the turn was
Now's the breakdown:
init
: The system message to use for this conversation. Can be blank.
conversations
: Turns of the conversation as a list of json objects, each object must contain a from
field with either human
or gpt
as the value, and a value
field with the text of the turn itself. For the sample to be treaded as completion, you must include a completion
tag in the tags list for that sample.
source
: The name of the dataset this conversation came from, currently unused within the pipeline, but can be pretty useful for filtering out datasets if you've included multiple within one.
tags
: Tags for this conversation, they are IMPORTANT as they are used to determine if the conversation should be treaded as completion or instruct.
Currently only one tag is in use: "completion"
.
By default, without it, each conversation is treated as if its instruct.
"completion"
makes so that the conversation will be treated as completion, and won't have prompt formatting added to it, and won't have the system message appended even if there is one.
Check out utils/dataset_converter.py
which might already contain a function for you to convert your dataset to the appropriate format.