Skip to content

Data Preparation

golololologol edited this page Jul 23, 2024 · 1 revision

The pipeline expects a custom dataset format, all to ensure that you actually feed it standardized datasets, and to save me the headache of trying to account for every dataset format implicitly within the code.

So the dataset format that it expects goes like this:
{"init": "Some system message", "conversations": [{"from": "human", "value": "user said hi"}, {"from": "gpt", "value": "AI said hi in return"}, {"from": "human", "value": "user says some verbal war crimes"}], "source": "aaa_dataset", "tags": []}

{"init": "", "conversations": [{"from": "human", "value": "Some completion slop"}], "source": "completion_dataset", "tags": ["completion"]} For completion it doesn't matter from who the turn was

Now's the breakdown:
init: The system message to use for this conversation. Can be blank.

conversations: Turns of the conversation as a list of json objects, each object must contain a from field with either human or gpt as the value, and a value field with the text of the turn itself. For the sample to be treaded as completion, you must include a completion tag in the tags list for that sample.

source: The name of the dataset this conversation came from, currently unused within the pipeline, but can be pretty useful for filtering out datasets if you've included multiple within one.

tags: Tags for this conversation, they are IMPORTANT as they are used to determine if the conversation should be treaded as completion or instruct.

Currently only one tag is in use: "completion".
By default, without it, each conversation is treated as if its instruct.
"completion" makes so that the conversation will be treated as completion, and won't have prompt formatting added to it, and won't have the system message appended even if there is one.

Check out utils/dataset_converter.py which might already contain a function for you to convert your dataset to the appropriate format.

Clone this wiki locally