Skip to content

Commit

Permalink
chore: update readme, lower kto groq examples
Browse files Browse the repository at this point in the history
  • Loading branch information
vTuanpham committed Sep 9, 2024
1 parent 06d95b8 commit 7d11ddd
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 7 deletions.
13 changes: 11 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,12 @@
<a href="https://colab.research.google.com/drive/1OEni8c9N9C_9Kf3ySt87goN7HDvRN3nw?usp=sharing">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>
<a href="https://www.kaggle.com/code/tuanphamm/groq-translation-public">
<img src="https://kaggle.com/static/images/open-in-kaggle.svg" alt="Open In Kaggle">
</a>
</p>


The Large Dataset Translator is a powerful solution designed to efficiently translate large datasets into various languages. It offers a streamlined and parallelized translation process, ensuring fast results without the need for an API key. The tool supports multithreaded processing, enabling users to translate extensive datasets in less time. It also includes an automatic fail-restart mechanism, ensuring uninterrupted translation in case of any issues.

### Key Features
Expand Down Expand Up @@ -90,8 +94,10 @@ or locally with:
```sh
python examples/argilla-magpie-ultra-v0.1-groq/MagpieUltraV01.py
```

##### [trl-lib/kto-mix-14k](https://huggingface.co/datasets/trl-lib/kto-mix-14k)
<a href="https://www.kaggle.com/code/tuanphamm/groq-translation-public">
<img src="https://kaggle.com/static/images/open-in-kaggle.svg" alt="Open In Kaggle">

```sh
%run examples/kto-mix-14k-groq/KTOmix14k_groq_Parser.py
```
Expand All @@ -102,7 +108,10 @@ python examples/kto-mix-14k-groq/KTOmix14k_groq_Parser.py

This script is capable of translating approximately 100 examples every 6-7 minutes using Groq. To use it, you will need to obtain a free [API key](https://console.groq.com/keys) and set the environment variable by executing `export GROQ_API_KEY=<your_api_key>`.

An example dataset that has utilized the above script: [1TuanPham/Vietnamese-magpie-ultra-v0.1](https://huggingface.co/datasets/1TuanPham/Vietnamese-magpie-ultra-v0.1)
Datasets that has utilized the above script:
* [1TuanPham/Vietnamese-magpie-ultra-v0.1](https://huggingface.co/datasets/1TuanPham/Vietnamese-magpie-ultra-v0.1)
* [1TuanPham/KTO-mix-14k-vietnamese-groq](https://huggingface.co/datasets/1TuanPham/KTO-mix-14k-vietnamese-groq)



## Usage
Expand Down
8 changes: 3 additions & 5 deletions examples/kto-mix-14k-groq/KTOmix14k_groq_Parser.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import sys

sys.path.insert(0, r"./")
from tqdm.auto import tqdm
from datasets import load_dataset
Expand All @@ -13,7 +12,7 @@
)


PARSER_NAME = "KTOmix14kGroq"
PARSER_NAME = "KTOmix14kGroq_first2k"


# The parser callback is used to process the converted data and translated data before saving it, this is useful for post-processing the data so that it compatible with the trl's KTOTrainer
Expand Down Expand Up @@ -64,7 +63,7 @@ def __init__(
file_path: str,
output_path: str,
target_lang: str = "vi",
max_example_per_thread=10,
max_example_per_thread=3,
max_example_length=15000,
large_chunks_threshold=200,
max_list_length_per_thread=1, # GroqProvider JSON mode is still in beta and is unreliable, so we limit the list length to 1 so it can be translate as a single string for list of strings
Expand All @@ -78,7 +77,6 @@ def __init__(
verbose=False, # Set verbose to True to see extra info of the parser process
target_config=KTOConfig,
target_fields=[
"system_prompt",
"conversation_history",
"agent_prompt_completion",
],
Expand Down Expand Up @@ -134,7 +132,7 @@ def convert(self) -> None:
data_converted.append(data_dict)

# Be sure to assign the final data list to self.converted_data
self.converted_data = data_converted
self.converted_data = data_converted[:2000] # Keep this low so that you don't exceed max request per day

return None

Expand Down

0 comments on commit 7d11ddd

Please sign in to comment.