Skip to content

Draft: Instruction tuning support #78

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from
172 changes: 172 additions & 0 deletions MMAP_DATASET_README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,3 +112,175 @@ def _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch):
doc_idx_last = _build_doc_idx(documents, 1, np_rng, False)
return np.concatenate((doc_idx_first, doc_idx_last))
```


# Fine-tuning Datasets

## Instruction Tuning
Instruction tuning datasets, such as Bactrian or LIMA, generally come in diverse formats. Therefore, before instruction-tuning a model with one of these datasets the user has to transform the dataset into the following format JSONL, inspired by Fast Chat. The listing below showcases an exemplary sample from the JSONL file.
The `id` represents the incremental sample id. `Conversations` contains the multi-turn messages between different parties. Here, we depicted messages between a human and a gpt model. Finally, the format allows for the specification of further, arbitrary key-value pairs such as instructions and roles.

```JSON
{
"id": 0,
"conversations": [
{
"from": "human_1",
"value": "What is up?"
},
{
"from": "gpt",
"value": "Hello! How can I help you today?"
},
{
"from": "human_1",
"value": "Who are you?"
},
{
"from": "gpt",
"value": "You can call me Mody, and I was trained by the modalities team as a language model."
},
{
"from": "human_2",
"value": "Goodbye"
},
{
"from": "gpt",
"value": "Goodbye! If you have any more questions in the future, don't hesitate to ask."
}
]

# optional / arbitrary key value pairs e.g.:
"instruction": "You are Mody, a helpful LLM trained by the modalities team"
"role": "Mody, a helpful LLM trained by the modalities team"
}
```
All JSONL files for instruction tuning have to follow this format.

Given a prepared JSONL file, the training / processing flow can be described as follows:
During the instantiation of the MemMap file, we specify the JQ patterns that determine which fields in the JSON are supposed to be tokenized and additionally pass a list of special tokens e.g., `<s>`, `</s>`, `<eod>` etc. to the constructor.
Each one of the special tokens is mapped to a single, individual token id once during the instantation of the MemMap file.

When the dataloader iterates over the MemMap file, the `__get_item__()` method tokenizes the sample as specified in the JQ patterns list and enriches the resulting dictionary with the token ids of the special tokens that we pre-computed during the MemMap file instantiation.
In other words, we extract the desired keys from the raw text dictionary, tokenize the content, build a new dictionary with the tokenized data and add the representation of special tokens to it.

Given the MemMap parameterization

```
tokenization_jq_patterns = [".conversations .value", ".instruction", ".role"]
pass_through_jq_patterns = [".id"]
special_tokens_map = {"b_instruction_token": "place_holder_token_100", ... }
```

```JSON
{
"id": 0,
"conversations": [
{
"from": "human_1",
"from_tokenized": "<human_1>",
"value": "<What is up?>"
},
{
"from": "gpt",
"from_tokenized": "<gpt>",
"value": "<Hello! How can I help you today?>"
},
{
"from": "human_1",
"from_tokenized": "<human_1>",
"value": "<Who are you?>"
},
{
"from": "gpt",
"from_tokenized": "<gpt>",
"value": "<You can call me Mody, and I was trained by the modalities team as a language model.>"
},
{
"from": "human_2",
"from_tokenized": "<human_2>",
"value": "<Goodbye>"
},
{
"from": "gpt",
"from_tokenized": "<gpt>",
"value": "<Goodbye! If you have any more questions in the future, don't hesitate to ask.>"
}
]

# optional / arbitrary key value pairs e.g.:
"instruction": "<You are Mody, a helpful LLM trained by the modalities team>"
"role": "<Mody, a helpful LLM trained by the modalities team>"
"special_tokens": {"bos_token": "<bos_token_id>", "eos_token": "eos_token_id>", ... "unk_token", "mask_token",
"b_role_token", "e_role_token", "b_instruction_token",
"e_instruction_token"}
}
```


The dataloader packs multiple samples to a `DatasetBatch` and calls the `Collator` for bringing the batch of samples into the correct format training.

The collator is instantiated with information on how to assemble the entire prompt from the `conversations` and the optional key-value pairs.
In practice, the YAML configuration has the following structure

```YAML
special_tokens:
bos_token: <s>
eos_token: </s>
b_role_token: <r>
e_role_token: </r>
b_instruction_token: <i>
e_instruction_token: </i>

loss_masking_jq_patterns:
- .conversations | select(.from == "human")
- .instruction
- .role

message_construction:
- b_role_token
- role
- e_role_token
- b_instruction_token
- instruction
- e_instruction_token
- conversations

assistant_role: gpt
```

To reduce the complexity of this example, we assume that each word is resembled by exactly one token and disregard punctuation. Similarly, we also did not replace each word by its token id.

Given the simplification, the batch is represented by the following data structure:


```JSON
[
"samples" : torch.Tensor([

<
(b_instruction_token)
You are Mody, a helpful LLM trained by the modalities team
(e_instruction_token)

(b_role_token)
Mody, a helpful LLM trained by the modalities team
(b_role_token)

human_1: What is up?
gpt: Hello! How can I help you today?

human_1: Who are you?
gpt:
(b_assistant_token)
You can call me Mody, and I was trained by the modalities team as a language model.
(e_assistant_token)

human_2: Goodbye
gpt: Goodbye! If you have any more questions in the future, don't hesitate to ask.>
...
]
"targets": <equals samples just shifted by one token>
"loss_mask":
]
```
23 changes: 21 additions & 2 deletions config_files/training/config_lorem_ipsum.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ settings:
experiment_id: ${modalities_env:experiment_id}
config_file_path: ${modalities_env:config_file_path}
referencing_keys:
sample_key: input_ids
sample_key: tokenized_input
target_key: target_ids
training:
training_log_interval_in_steps: 2
Expand All @@ -20,6 +20,12 @@ settings:
paths:
checkpointing_path: data/checkpoints

tokenizer:
component_key: tokenizer
variant_key: gpt2_tokenizer_fast
config:
tokenizer_file: data/tokenizer/tokenizer_gpt2.json

collate_fn:
component_key: collate_fn
variant_key: gpt_2_llm_collator
Expand All @@ -31,8 +37,21 @@ train_dataset:
component_key: dataset
variant_key: packed_mem_map_dataset_continuous
config:
raw_data_path: ./data/lorem_ipsum.pbin
raw_data_path: data/lorem_ipsum_instruct_multi_turn.jsonl
index_path: data/lorem_ipsum_instruct_multi_turn.idx
sequence_length: ${settings.training.sequence_length}
block_size: ${settings.training.sequence_length}
tokenization_jq_patterns:
${settings.referencing_keys.sample_key}: .conversations
pass_through_jq_patterns:
raw_text: .conversations

# tokenization_jq_patterns:
# - new_key: input_ids # ${settings.referencing_keys.sample_key}
# jq_pattern: .text
# pass_through_jq_patterns:
# - new_key: raw_text
# jq_pattern: .text
sample_key: ${settings.referencing_keys.sample_key}

train_dataloader:
Expand Down
Binary file added data/lorem_ipsum_instruct_multi_turn.idx
Binary file not shown.
5 changes: 5 additions & 0 deletions data/lorem_ipsum_instruct_multi_turn.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{"Conversations": ["0 Who is the president of the United States", "Joe Biden"]}
{"Conversations": ["1 Who is the chancellor of Germany", "Olaf Scholz", "Thank you."]}
{"conversations": ["2 What is the most effective weapon in CS? ", "Are you referring to Counter Strike 2?", "Yes.", "The most effective weapon from a damage point of view is the AWP"]}
{"conversations": ["3 What is the capital of France", "Paris"]}
{"conversations": ["4 What is the capital of Germany", "Berlin"]}
10 changes: 10 additions & 0 deletions data/lorem_ipsum_sft.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{"id": 0, "conversations": [{"from": "human_1", "value": "What is up?`"}, {"from": "gpt", "value": "Hello! How can I help you today?"}, {"from": "human_1", "value": "Who are you?"}, {"from": "gpt", "value": "You can call me Mody, and I was trained by the modalities team as a language model."}, {"from": "human_2", "value": "Goodbye"}, {"from": "gpt", "value": "Goodbye! If you have any more questions in the future, don't hesitate to ask."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
{"id": 1, "conversations": [{"from": "human_1", "value": "How does AI work?"}, {"from": "gpt", "value": "AI, or artificial intelligence, operates by analyzing data and patterns to make decisions or predictions. Would you like a more detailed explanation?"}, {"from": "human_1", "value": "Yes, please."}, {"from": "gpt", "value": "Sure, AI systems, such as neural networks, are trained using large amounts of data. They learn to recognize patterns and make predictions. It's like how a child learns from experience."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
{"id": 2, "conversations": [{"from": "human_1", "value": "Can you write a poem?"}, {"from": "gpt", "value": "Of course! Give me a topic, and I'll write something for you."}, {"from": "human_1", "value": "Write about the sea."}, {"from": "gpt", "value": "The sea, a vast expanse, so deep and wide, Whispers secrets at the tide's low ebb, Its waves, the rhythm of time's endless web, A world of mystery beneath its hide."}, {"from": "human_2", "value": "That's beautiful, thanks!"}, {"from": "gpt", "value": "You're welcome! I'm glad you liked it. If you have any more requests, feel free to ask."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
{"id": 3, "conversations": [{"from": "human_1", "value": "What's the weather like today?"}, {"from": "gpt", "value": "I'm not able to check real-time data, but I can tell you how to find out using a weather app or website. Would that help?"}, {"from": "human_1", "value": "Yes, please."}, {"from": "gpt", "value": "You can use weather websites like Weather.com or apps like AccuWeather to get current weather conditions and forecasts."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
{"id": 4, "conversations": [{"from": "human_1", "value": "What is the capital of France?"}, {"from": "gpt", "value": "The capital of France is Paris. It's known for its art, culture, and history. Have you ever been there?"}, {"from": "human_1", "value": "No, but I'd love to visit someday."}, {"from": "gpt", "value": "It's a beautiful city with so much to see and do. I hope you get the chance to visit soon!"}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
{"id": 5, "conversations": [{"from": "human_1", "value": "What is the meaning of life?"}, {"from": "gpt", "value": "The meaning of life is a philosophical question that has been debated for centuries. Some believe it's about finding happiness, others think it's about fulfilling a purpose. What do you think?"}, {"from": "human_1", "value": "I think it's about making the most of the time we have and finding joy in the little things."}, {"from": "gpt", "value": "That's a beautiful perspective. It's important to appreciate the present and find joy in everyday moments."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
{"id": 6, "conversations": [{"from": "human_1", "value": "What is the best way to learn a new language?"}, {"from": "gpt", "value": "The best way to learn a new language is to practice regularly, immerse yourself in the language, and use a variety of resources like books, apps, and language classes. It's also helpful to practice with native speakers."}, {"from": "human_1", "value": "Thank you for the advice."}, {"from": "gpt", "value": "You're welcome! Learning a new language can be a rewarding experience. If you have any more questions, feel free to ask."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
{"id": 7, "conversations": [{"from": "human_1", "value": "What is the best way to stay healthy?"}, {"from": "gpt", "value": "The best way to stay healthy is to eat a balanced diet, exercise regularly, get enough sleep, and manage stress. It's also important to see your doctor for regular check-ups and screenings."}, {"from": "human_1", "value": "Thank you for the advice."}, {"from": "gpt", "value": "You're welcome! Taking care of your health is important for your overall well-being. If you have any more questions, feel free to ask."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
{"id": 8, "conversations": [{"from": "human_1", "value": "What is the best way to study for exams?"}, {"from": "gpt", "value": "The best way to study for exams is to create a study schedule, break down the material into manageable chunks, and use active learning techniques like flashcards or practice tests. It's also important to get plenty of rest and stay hydrated."}, {"from": "human_1", "value": "Thank you for the advice."}, {"from": "gpt", "value": "You're welcome! Studying can be challenging, but with the right strategies, you can succeed. If you have any more questions, feel free to ask."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
{"id": 9, "conversations": [{"from": "human_1", "value": "What is the best way to save money?"}, {"from": "gpt", "value": "The best way to save money is to create a budget, track your expenses, and look for ways to cut costs. You can also set financial goals and automate your savings to make it easier to save."}, {"from": "human_1", "value": "Thank you for the advice."}, {"from": "gpt", "value": "You're welcome! Saving money is an important skill that can help you achieve your financial goals. If you have any more questions, feel free to ask."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
3 changes: 2 additions & 1 deletion src/modalities/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -261,7 +261,8 @@ class MemMapDatasetConfig(BaseModel):
index_path: Optional[FilePath] = None
sequence_length: Annotated[int, Field(strict=True, gt=1)]
tokenizer: PydanticTokenizerIFType
jq_pattern: str
tokenization_jq_patterns: Dict[str, str]
pass_through_jq_patterns: Optional[Dict[str, str]] = None
sample_key: str


Expand Down
Loading