Model after fine-tune doesn't produce the zero-width non-joiner (u200C) and zero-width joiner (u200D) characters #112

mkorunoski · 2025-01-21T10:14:12Z

I am trying to fine-tune the model on my specific dataset, however I came into a problem. The zero-width non-joiner (u200C) and zero-width joiner (u200D) characters are not present after fine-tuning using the LoRA script provided here:

https://github.com/AI4Bharat/IndicTrans2/blob/8de6eca588cfcd7648464084199c4881c41f58ab/huggingface_interface/train_lora.py

In particular, the lines bellow in load_and_process_translation_dataset remove these characters:

complete_dataset["sentence_SRC"] += processor.preprocess_batch(
    src_lines, src_lang=src_lang, tgt_lang=tgt_lang, is_target=False
)

complete_dataset["sentence_TGT"] += processor.preprocess_batch(
    tgt_lines, src_lang=tgt_lang, tgt_lang=src_lang, is_target=True
)

Also, I've noticed that for sentence_TGT, src_lang and tgt_lang arguments are reversed. Is this the case? After that line, the characters are not present in the string anymore.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model after fine-tune doesn't produce the zero-width non-joiner (u200C) and zero-width joiner (u200D) characters #112

Model after fine-tune doesn't produce the zero-width non-joiner (u200C) and zero-width joiner (u200D) characters #112

mkorunoski commented Jan 21, 2025 •

edited

Loading

Model after fine-tune doesn't produce the zero-width non-joiner (u200C) and zero-width joiner (u200D) characters #112

Model after fine-tune doesn't produce the zero-width non-joiner (u200C) and zero-width joiner (u200D) characters #112

Comments

mkorunoski commented Jan 21, 2025 • edited Loading

mkorunoski commented Jan 21, 2025 •

edited

Loading