Using stopes with an unseen language #16

sete-nay · 2022-11-02T12:49:24Z

Hi,
I'm trying to clean and preprocess bitext for finetuning NLLB on a new unseen language. The source language is a part of laser3, the target language is not included. Will it work if I replace laser3 with BPE encoder pre-trained on my target language?
Thank you!

python -m stopes.pipelines.bitext.global_mining_pipeline src_lang=fuv tgt_lang=zul demo_dir=.../stopes-repo/demo +preset=demo output_dir=. embed_text=laser3

The text was updated successfully, but these errors were encountered:

Mortimerp9 · 2022-11-02T12:56:39Z

the laser3 encoder will project your text into an embedding space that is language independent. The way that mining works is that it aligns projections of the sentences from the src_lang into that space, with projections of the sentences from the tgt_lang into that space. This works because they are projected in the same language independent space and we can compute a distance between the embeddings of each sentences.

If you use a different encoder, it will probably not project into a compatible space.

sete-nay · 2022-11-02T13:52:27Z

Thanks, will try it with laser3. What should I indicate in tgt_lang for the unseen language?

avidale · 2022-11-02T14:16:15Z

What should I indicate in tgt_lang for the unseen language?

You can assign any name you want to the new language. If this name is abc, then you will need to indicate tgt_lang=abc in the entry command.

Also, you need to make sure that the mining config is correctly showing how to find source files for that language. In case of using the demo config (+preset=demo in your command, which corresponds to this configuration), you will need to have the following two files:

$demo_dir/abc.gz with the source text in your language.
$demo_dir/abc.nl with the number of lines of the file above.

Finally, you will need to add the path to your custom encoder (and its vocabulary, if it is also custom) to the lang_configs part of the demo config.

heffernankevin · 2022-11-02T14:45:08Z

Hi, I'm trying to clean and preprocess bitext for finetuning NLLB on a new unseen language. The source language is a part of laser3, the target language is not included. Will it work if I replace laser3 with BPE encoder pre-trained on my target language? Thank you!

python -m stopes.pipelines.bitext.global_mining_pipeline src_lang=fuv tgt_lang=zul demo_dir=.../stopes-repo/demo +preset=demo output_dir=. embed_text=laser3

Hi @sete-nay, out of curiosity what is your tgt_lang? LASER3 + LASER2 covers over 200 languages. If the target lang isn't covered by LASER3, it may be included in LASER2. You can find the list of supported languages for LASER2 here. If it's not in either of them, you could even try to create your own LASER3 encoder and mine using this. The training code to do so is here.

sete-nay · 2022-11-02T14:56:17Z

Hi @heffernankevin, my tgt_lang is Circassian (Kabardian) and not a part of laser2 or 3, unfortunately. Thanks for the hint, will look into laser encoder training or otherwise just use a simpler tool. My goal is to create parallel corpus that can be used for finetuning NLLB or another multilingual model on Circassian language.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using stopes with an unseen language #16

Using stopes with an unseen language #16

sete-nay commented Nov 2, 2022

Mortimerp9 commented Nov 2, 2022

sete-nay commented Nov 2, 2022

avidale commented Nov 2, 2022

heffernankevin commented Nov 2, 2022

sete-nay commented Nov 2, 2022 •

edited

Loading

Using stopes with an unseen language #16

Using stopes with an unseen language #16

Comments

sete-nay commented Nov 2, 2022

Mortimerp9 commented Nov 2, 2022

sete-nay commented Nov 2, 2022

avidale commented Nov 2, 2022

heffernankevin commented Nov 2, 2022

sete-nay commented Nov 2, 2022 • edited Loading

sete-nay commented Nov 2, 2022 •

edited

Loading