Modalities · le1nux · Mar 17, 2024 · Mar 17, 2024 · Mar 21, 2024 · Mar 21, 2024
diff --git a/MMAP_DATASET_README.md b/MMAP_DATASET_README.md
@@ -112,3 +112,175 @@ def _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch):
     doc_idx_last = _build_doc_idx(documents, 1, np_rng, False)
     return np.concatenate((doc_idx_first, doc_idx_last))
 ```
+
+
+# Fine-tuning Datasets
+
+## Instruction Tuning
+Instruction tuning datasets, such as Bactrian or LIMA, generally come in diverse formats. Therefore, before instruction-tuning a model with one of these datasets the user has to transform the dataset into the following format JSONL, inspired by Fast Chat. The listing below showcases an exemplary sample from the JSONL file. 
+The `id` represents the incremental sample id. `Conversations` contains the multi-turn messages between different parties. Here, we depicted messages between a human and a gpt model. Finally, the format allows for the specification of further, arbitrary key-value pairs such as instructions and roles.
+
+```JSON
+{
+    "id": 0,
+    "conversations": [
+      {
+        "from": "human_1",
+        "value": "What is up?"
+      },
+      {
+        "from": "gpt",
+        "value": "Hello! How can I help you today?"
+      },
+      {
+        "from": "human_1",
+        "value": "Who are you?"
+      },
+      {
+        "from": "gpt",
+        "value": "You can call me Mody, and I was trained by the modalities team as a language model."
+      },
+      {
+        "from": "human_2",
+        "value": "Goodbye"
+      },
+      {
+        "from": "gpt",
+        "value": "Goodbye! If you have any more questions in the future, don't hesitate to ask."
+      }
+    ]
+
+    # optional / arbitrary key value pairs e.g.:
+    "instruction": "You are Mody, a helpful LLM trained by the modalities team"
+    "role": "Mody, a helpful LLM trained by the modalities team"
+}
+```
+All JSONL files for instruction tuning have to follow this format.
+
+Given a prepared JSONL file, the training / processing flow can be described as follows:
+During the instantiation of the MemMap file, we specify the JQ patterns that determine which fields in the JSON are supposed to be tokenized and additionally pass a list of special tokens e.g., `<s>`, `</s>`, `<eod>` etc. to the constructor. 
+Each one of the special tokens is mapped to a single, individual token id once during the instantation of the MemMap file. 
+
+When the dataloader iterates over the MemMap file, the `__get_item__()` method tokenizes the sample as specified in the JQ patterns list and enriches the resulting dictionary with the token ids of the special tokens that we pre-computed during the MemMap file instantiation. 
+In other words, we extract the desired keys from the raw text dictionary, tokenize the content, build a new dictionary with the tokenized data and add the representation of special tokens to it.
+
+Given the MemMap parameterization
+
+```
+  tokenization_jq_patterns = [".conversations .value", ".instruction", ".role"]
+  pass_through_jq_patterns = [".id"]
+  special_tokens_map = {"b_instruction_token": "place_holder_token_100", ... }
+```
+
+```JSON
+{
+    "id": 0,
+    "conversations": [
+      {
+        "from": "human_1",
+        "from_tokenized": "<human_1>",
+        "value": "<What is up?>"
+      },
+      {
+        "from": "gpt",
+        "from_tokenized": "<gpt>",
+        "value": "<Hello! How can I help you today?>"
+      },
+      {
+        "from": "human_1",
+        "from_tokenized": "<human_1>",
+        "value": "<Who are you?>"
+      },
+      {
+        "from": "gpt",
+        "from_tokenized": "<gpt>",
+        "value": "<You can call me Mody, and I was trained by the modalities team as a language model.>"
+      },
+      {
+        "from": "human_2",
+        "from_tokenized": "<human_2>",
+        "value": "<Goodbye>"
+      },
+      {
+        "from": "gpt",
+        "from_tokenized": "<gpt>",
+        "value": "<Goodbye! If you have any more questions in the future, don't hesitate to ask.>"
+      }
+    ]
+
+    # optional / arbitrary key value pairs e.g.:
+    "instruction": "<You are Mody, a helpful LLM trained by the modalities team>"
+    "role": "<Mody, a helpful LLM trained by the modalities team>"
+    "special_tokens": {"bos_token": "<bos_token_id>", "eos_token": "eos_token_id>", ... "unk_token", "mask_token", 
+                    "b_role_token", "e_role_token", "b_instruction_token", 
+                    "e_instruction_token"}
+}
+```
+
+
+The dataloader packs multiple samples to a `DatasetBatch` and calls the `Collator` for bringing the batch of samples into the correct format training. 
+
+The collator is instantiated with information on how to assemble the entire prompt from the `conversations` and the optional key-value pairs. 
+In practice, the YAML configuration has the following structure
+
+```YAML
+special_tokens:
+    bos_token: <s>
+    eos_token: </s>
+    b_role_token: <r>
+    e_role_token: </r>
+    b_instruction_token: <i>
+    e_instruction_token: </i>
+
+loss_masking_jq_patterns:
+    - .conversations | select(.from == "human")
+    - .instruction
+    - .role
+
+message_construction: 
+  - b_role_token
+  - role
+  - e_role_token
+  - b_instruction_token
+  - instruction
+  - e_instruction_token
+  - conversations
+
+  assistant_role: gpt
+```
+
+To reduce the complexity of this example, we assume that each word is resembled by exactly one token and disregard punctuation. Similarly, we also did not replace each word by its token id. 
+
+Given the simplification, the batch is represented by the following data structure: 
+
+
+```JSON
+[
+  "samples" : torch.Tensor([
+
+    <
+    (b_instruction_token)
+    You are Mody, a helpful LLM trained by the modalities team
+    (e_instruction_token)
+
+    (b_role_token)
+    Mody, a helpful LLM trained by the modalities team
+    (b_role_token)
+
+    human_1: What is up?
+    gpt: Hello! How can I help you today?
+
+    human_1: Who are you?
+    gpt:
+    (b_assistant_token) 
+    You can call me Mody, and I was trained by the modalities team as a language model.
+    (e_assistant_token) 
+
+    human_2: Goodbye
+    gpt: Goodbye! If you have any more questions in the future, don't hesitate to ask.>
+    ...
+  ]
+  "targets": <equals samples just shifted by one token>
+  "loss_mask": 
+]
+```
diff --git a/config_files/training/config_lorem_ipsum.yaml b/config_files/training/config_lorem_ipsum.yaml
@@ -2,7 +2,7 @@ settings:
   experiment_id: ${modalities_env:experiment_id}
   config_file_path: ${modalities_env:config_file_path}
   referencing_keys:
-    sample_key: input_ids
+    sample_key: tokenized_input
     target_key: target_ids
   training:
     training_log_interval_in_steps: 2
@@ -20,6 +20,12 @@ settings:
   paths:
     checkpointing_path: data/checkpoints
 
+tokenizer:
+  component_key: tokenizer
+  variant_key: gpt2_tokenizer_fast
+  config:
+    tokenizer_file: data/tokenizer/tokenizer_gpt2.json
+
 collate_fn:  
   component_key: collate_fn
   variant_key: gpt_2_llm_collator
@@ -31,8 +37,21 @@ train_dataset:
   component_key: dataset
   variant_key: packed_mem_map_dataset_continuous
   config:
-    raw_data_path: ./data/lorem_ipsum.pbin
+    raw_data_path: data/lorem_ipsum_instruct_multi_turn.jsonl
+    index_path: data/lorem_ipsum_instruct_multi_turn.idx
     sequence_length: ${settings.training.sequence_length}
+    block_size: ${settings.training.sequence_length}
+    tokenization_jq_patterns: 
+      ${settings.referencing_keys.sample_key}: .conversations
+    pass_through_jq_patterns:
+      raw_text:  .conversations
+
+    # tokenization_jq_patterns: 
+    #   - new_key: input_ids # ${settings.referencing_keys.sample_key}
+    #     jq_pattern: .text
+    # pass_through_jq_patterns:
+    #   - new_key: raw_text
+    #     jq_pattern: .text
     sample_key:  ${settings.referencing_keys.sample_key}
 
 train_dataloader:

diff --git a/data/lorem_ipsum_instruct_multi_turn.idx b/data/lorem_ipsum_instruct_multi_turn.idx
diff --git a/data/lorem_ipsum_instruct_multi_turn.jsonl b/data/lorem_ipsum_instruct_multi_turn.jsonl
@@ -0,0 +1,5 @@
+{"Conversations": ["0 Who is the president of the United States", "Joe Biden"]}
+{"Conversations": ["1 Who is the chancellor of Germany", "Olaf Scholz", "Thank you."]}
+{"conversations": ["2 What is the most effective weapon in CS? ", "Are you referring to Counter Strike 2?", "Yes.", "The most effective weapon from a damage point of view is the AWP"]}
+{"conversations": ["3 What is the capital of France", "Paris"]}
+{"conversations": ["4 What is the capital of Germany", "Berlin"]}
diff --git a/data/lorem_ipsum_sft.jsonl b/data/lorem_ipsum_sft.jsonl
@@ -0,0 +1,10 @@
+{"id": 0, "conversations": [{"from": "human_1", "value": "What is up?`"}, {"from": "gpt", "value": "Hello! How can I help you today?"}, {"from": "human_1", "value": "Who are you?"}, {"from": "gpt", "value": "You can call me Mody, and I was trained by the modalities team as a language model."}, {"from": "human_2", "value": "Goodbye"}, {"from": "gpt", "value": "Goodbye! If you have any more questions in the future, don't hesitate to ask."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
+{"id": 1, "conversations": [{"from": "human_1", "value": "How does AI work?"}, {"from": "gpt", "value": "AI, or artificial intelligence, operates by analyzing data and patterns to make decisions or predictions. Would you like a more detailed explanation?"}, {"from": "human_1", "value": "Yes, please."}, {"from": "gpt", "value": "Sure, AI systems, such as neural networks, are trained using large amounts of data. They learn to recognize patterns and make predictions. It's like how a child learns from experience."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
+{"id": 2, "conversations": [{"from": "human_1", "value": "Can you write a poem?"}, {"from": "gpt", "value": "Of course! Give me a topic, and I'll write something for you."}, {"from": "human_1", "value": "Write about the sea."}, {"from": "gpt", "value": "The sea, a vast expanse, so deep and wide, Whispers secrets at the tide's low ebb, Its waves, the rhythm of time's endless web, A world of mystery beneath its hide."}, {"from": "human_2", "value": "That's beautiful, thanks!"}, {"from": "gpt", "value": "You're welcome! I'm glad you liked it. If you have any more requests, feel free to ask."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
+{"id": 3, "conversations": [{"from": "human_1", "value": "What's the weather like today?"}, {"from": "gpt", "value": "I'm not able to check real-time data, but I can tell you how to find out using a weather app or website. Would that help?"}, {"from": "human_1", "value": "Yes, please."}, {"from": "gpt", "value": "You can use weather websites like Weather.com or apps like AccuWeather to get current weather conditions and forecasts."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
+{"id": 4, "conversations": [{"from": "human_1", "value": "What is the capital of France?"}, {"from": "gpt", "value": "The capital of France is Paris. It's known for its art, culture, and history. Have you ever been there?"}, {"from": "human_1", "value": "No, but I'd love to visit someday."}, {"from": "gpt", "value": "It's a beautiful city with so much to see and do. I hope you get the chance to visit soon!"}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
+{"id": 5, "conversations": [{"from": "human_1", "value": "What is the meaning of life?"}, {"from": "gpt", "value": "The meaning of life is a philosophical question that has been debated for centuries. Some believe it's about finding happiness, others think it's about fulfilling a purpose. What do you think?"}, {"from": "human_1", "value": "I think it's about making the most of the time we have and finding joy in the little things."}, {"from": "gpt", "value": "That's a beautiful perspective. It's important to appreciate the present and find joy in everyday moments."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
+{"id": 6, "conversations": [{"from": "human_1", "value": "What is the best way to learn a new language?"}, {"from": "gpt", "value": "The best way to learn a new language is to practice regularly, immerse yourself in the language, and use a variety of resources like books, apps, and language classes. It's also helpful to practice with native speakers."}, {"from": "human_1", "value": "Thank you for the advice."}, {"from": "gpt", "value": "You're welcome! Learning a new language can be a rewarding experience. If you have any more questions, feel free to ask."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
+{"id": 7, "conversations": [{"from": "human_1", "value": "What is the best way to stay healthy?"}, {"from": "gpt", "value": "The best way to stay healthy is to eat a balanced diet, exercise regularly, get enough sleep, and manage stress. It's also important to see your doctor for regular check-ups and screenings."}, {"from": "human_1", "value": "Thank you for the advice."}, {"from": "gpt", "value": "You're welcome! Taking care of your health is important for your overall well-being. If you have any more questions, feel free to ask."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
+{"id": 8, "conversations": [{"from": "human_1", "value": "What is the best way to study for exams?"}, {"from": "gpt", "value": "The best way to study for exams is to create a study schedule, break down the material into manageable chunks, and use active learning techniques like flashcards or practice tests. It's also important to get plenty of rest and stay hydrated."}, {"from": "human_1", "value": "Thank you for the advice."}, {"from": "gpt", "value": "You're welcome! Studying can be challenging, but with the right strategies, you can succeed. If you have any more questions, feel free to ask."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
+{"id": 9, "conversations": [{"from": "human_1", "value": "What is the best way to save money?"}, {"from": "gpt", "value": "The best way to save money is to create a budget, track your expenses, and look for ways to cut costs. You can also set financial goals and automate your savings to make it easier to save."}, {"from": "human_1", "value": "Thank you for the advice."}, {"from": "gpt", "value": "You're welcome! Saving money is an important skill that can help you achieve your financial goals. If you have any more questions, feel free to ask."}], "instruction": "You are Mody, a helpful LLM trained by the modalities team", "role": "Mody, a helpful LLM trained by the modalities team", "special_tokens": {"bos_token": "bos", "eos_token": "eos"}}
diff --git a/src/modalities/config/config.py b/src/modalities/config/config.py
@@ -261,7 +261,8 @@ class MemMapDatasetConfig(BaseModel):
     index_path: Optional[FilePath] = None
     sequence_length: Annotated[int, Field(strict=True, gt=1)]
     tokenizer: PydanticTokenizerIFType
-    jq_pattern: str
+    tokenization_jq_patterns: Dict[str, str]
+    pass_through_jq_patterns: Optional[Dict[str, str]] = None
     sample_key: str