Collecting pre-training dataset for translation #2

gksoriginals · 2024-01-02T13:20:51Z

OpenHathi approach for fine-tuning a llama2 for Hindi is like first pre-train the model for translation and then for next word prediction. So we need to collect English to Malayalam dataset(s).

Find existing datasets for translation and update in this issue thread.
Translate wikipedia dataset to Malayalam using open-source models or google translate.

gksoriginals added the good first issue Good for newcomers label Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collecting pre-training dataset for translation #2

Collecting pre-training dataset for translation #2

gksoriginals commented Jan 2, 2024 •

edited

Loading

Collecting pre-training dataset for translation #2

Collecting pre-training dataset for translation #2

Comments

gksoriginals commented Jan 2, 2024 • edited Loading

gksoriginals commented Jan 2, 2024 •

edited

Loading