Skip to content

for-ai/instruct-multilingual

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Instruct Multilingual

This repository contains code to translate datasets into multiple languages using the NLLB (No Language Left Behind) model. The model can be used:

Note: This repository has been tested on a Linux machine using a Nvidia GPU. The code assumes access to a GPU. Depending on your hardware, you might need to modify the code to change the number of GPUs and the batch size.

Setup

It is recommended to use a virtual environment to install the dependencies.

conda create -n instructmultilingual python=3.8.10 -y
conda activate instructmultilingual
pip install -r requirements.txt

Inference Server

The inference server is a FastAPI application that can be used to translate a single text or entire datasets.

Convert the models first

For efficient inference, the model is converted using CTranslate2.

mkdir models
ct2-transformers-converter --model facebook/nllb-200-3.3B --output_dir models/nllb-200-3.3B-converted

Run the server locally

To start the server, we need to run the following command:

uvicorn instructmultilingual.server:app --host 0.0.0.0 --port 8000

Using docker to run the server

To run the server using docker, we need to build and run the docker image, and run the server.

# Build
docker build -t instruct-multilingual .

# Run
docker run -it --rm --gpus 1 -p 8000:8000 -v $(pwd):/instruct-multilingual instruct-multilingual

Client Side

This script translate the samsum dataset using the inference server. It showcases how to use the inference server to translate a single text and entire datasets.

python main.py

Translate

We also provide a script to translate a single text from the CLI. This script downloads the model from Hugging Face and translates the text provided into the target language.

python -m instructmultilingual.translate \
          --text="Cohere For AI will make the best instruct multilingual model in the world" \
          --source_language="English" \
          --target_language="Egyptian Arabic"

Translate an instructional dataset from xP3 (or any dataset repo from HuggingFace Hub)

An example of using translate_dataset_from_huggingface_hub to translate PIQA with the finish_sentence_with_correct_choice template into languages used by Multilingual T5 (mT5) model

from instructmultilingual.translate_datasets import translate_dataset_from_huggingface_hub

translate_dataset_from_huggingface_hub(
    repo_id = "bigscience/xP3",
    train_set = ["en/xp3_piqa_None_train_finish_sentence_with_correct_choice.jsonl"],
    validation_set = ["en/xp3_piqa_None_validation_finish_sentence_with_correct_choice.jsonl"],
    test_set = [],
    dataset_name="PIQA",
    template_name="finish_sentence_with_correct_choice",
    splits=["train", "validation"],
    translate_keys=["inputs", "targets"],
    url= "http://localhost:8000/translate",
    output_dir= "/home/weiyi/instruct-multilingual/datasets",
    source_language= "English",
    checkpoint="facebook/nllb-200-3.3B",
)