We are releasing all the translations of NLLB-200 (MoE, 54.5B) model on the following benchmarks:
- Translations of all 40602 directions in the FLORES-200 benchmark
- Translations of non-FLORES benchmarks we evaluated our model on in the NLLB-200 paper. These include: Flores(v1), WAT, WMT, TICO, Mafand-mt and Autshumato. See details of each evaluation benchmark below.
- FLORES-200 benchmark translations can be downloaded here
- The translations of the other benchmarks can be downloaded here
-
For FLORES-200 see here
-
For the other benchmarks: To prepare the references in a
$references_directory
, you will first need to manually downlaod some of the corpora and agree to terms of use.These include:
- Autshumato_MT_Evaluation_Set.zip from here
- MADAR.Parallel-Corpora-Public-Version1.1-25MAR2021.zip from here
- 2017-01-mted-test.tgz from here
- 2017-01-ted-test.tgz from here
- 2015-01-test.tgz from here
- 2014-01-test.tgz from here
Note that due to copyright issues, we cannot release translations of IWSLT and MADAR but the download_other_corpora.py would allow you to reproduce our exact evaluation benchmark.
The listed files should be in a
$pre_downloaded_resources
directory.You will also need some of the MOSES scripts
git clone https://github.com/moses-smt/mosesdecoder.git MOSES=PATH_to_mosesdecoder python examples/nllb/evaluation/download_other_corpora.py -d $references_directory -p $pre_downloaded_resources
CORPUS=
METRIC=
python examples/nllb/evaluation/calculate_metrics.py \
--corpus $CORPUS \
--translate-dir ${generations} \
--reference-dir ${references} \
--metric $METRIC --output-dir ${output}
Besides FLORES-200, these are the evaluation benchmarks we evaluate NLLB-200 on:
-
Flores(v1)}: with a total of 8 directions, the original Flores dataset~\citep{guzman2019floresv1} pairs four low-resource languages with English in the Wikimedia domain:
- Nepali (ne, npi_Deva)
- Sinhala (si, sin_Sinh)
- Khmer (km, khm_Khmr)
- Pashto (ps, pbt_Arab)
-
WAT: we select 3 languages paired with English (6 directions) from the WAT competition:
- (hin_Deva)
- (khm_Khmr)
- (mya_Mymr)
-
WMT : We evaluate on the 15 WMT languages selected in Siddhant et al., 2020. The 15 languages paired with English in this set are:
- Czech (WMT 18, cs, ces_Latn)
- German (WMT 14, de, deu_Latn)
- Estonian (WMT 18, et, est_Latn)
- Finnish (WMT 19, fi, fin_Latn)
- French (WMT 14, fr, fra_Latn)
- Gujarati (WMT 19, gu, guj_Gujr)
- Hindi (WMT 14, hi, hin_Deva)
- Kazakh (WMT 19, kk, kaz_Cyrl)
- Lithuanian (WMT 19, lt, lit_Latn)
- Standard Latvian (WMT 17, lv, lvs_Latn)
- Romanian (WMT 16, ro, ron_Latn)
- Russian (WMT 19, ru, rus_Cyrl)
- Spanish (WMT 13, es, spa_Latn)
- Turkish (WMT 18, tr, tur_Latn)
- Chinese (simplified) (WMT 19, zh, zho_Hans).
-
TICO: sampled from a variety of public sources containing COVID-19 related content, this dataset comes from different domains (medical, news, conversational, etc.) and covers 36 languages. We pair 28 languages with English for a total of 56 directions.
-
Mafand-MT: an African news corpus that covers 16 languages. We evaluate 7 languages paired with English and 5 other paired with French for a total of 24 directions.
-
Paired with English (en, eng_Latn)
- Hausa (hau, hau_Latn)
- Igbo (ibo, ibo_Latn)
- Luganda (lug, lug_Latn)
- Swahili (swa, swh_Latn)
- Setswana (tsn, tsn_Latn)
- Yoruba (yor, yor_Latn)
- Zulu (zul, zul_Latn)
-
Paired with French (fr, fra_Latn)
- Bambara (bam, bam_Latn)
- Ewe (ewe, ewe_Latn)
- Fon (fon, fon_Latn)
- Mossi (mos, mos_Latn)
- Wolof (wol, wol_Latn)
-
-
Autshumato: an evaluation set for machine translation of South African languages, it consists of 500 sentences from South African governmental data, translated separately by four different professional human translators for each of the 11 official South African languages. 9 of these languages are covered by NLLB-200:
- Afrikaans (afr_Latn)
- English (eng_Latn)
- Sepedi / Northern Sotho (nso_Latn)
- Sesotho / Southern Sotho (sot_Latn)
- Siswati/Swati (ssw_Latn)
- Setswana/Tswana (tsn_Latn)
- Xitsonga/Tsonga (tso_Latn)
- IsiXhosa/Xhosa (xho_Latn)
- IsiZulu/Zulu (zul_Latn).
There is no standard valid/test split, so we use the first half (250 sentences yielding 1000 pairs) for validation and the second half for testing - see script.