No Error raised for a false/wrong tag and same results are obtained even the tag is changed #34

VinuraD · 2021-06-14T11:54:04Z

Hi,
I was just testing different outputs with the 'laserembeddings' pypi package. One thing I observed is that, it doesn't raise an error for a false tag. (such as 'xx', 'yy' or even for single letters like 'x','y') Also, when I tried with Sinhala language, (tag='si') I observed that I get an output embedding even if I change the tag (to a valid or a false one) and all the time these outputs are the same. How can this behavior be explained? How can we verify the results? Or could there be something wrong with my setup?
Python==3.7.10
torch==1.8.1+cu101

from laserembeddings import Laser

laser = Laser()

embeddings = laser.embed_sentences(
    ["අත්සන් කළේ චරිත හේරත්"],
    lang='si')

embeddings2=laser.embed_sentences(
    ["අත්සන් කළේ චරිත හේරත්"],
    lang='y')

embeddings3=laser.embed_sentences(
    ["අත්සන් කළේ චරිත හේරත්"],
    lang='en')

embeddings4=laser.embed_sentences(
    ["A test sentence"],
    lang='si') #even the tag is different getting a result

comp=embeddings2==embeddings
comp2=embeddings2==embeddings3
print(np.sum(comp))
print(np.sum(comp2))
print(comp.all())
print(embeddings4)

Result:

1024 1024 True [[2.5131851e-03 4.6637398e-04 3.9160903e-05 ... 1.0697229e-02 1.6339000e-02 1.8368352e-02]]

The text was updated successfully, but these errors were encountered:

yannvgn · 2021-06-20T07:03:06Z

Hi @VinuraD,

The sentences are first tokenized before being embedded.
The tokenization step relies on Moses in Facebook's LASER original implementation. For portability reasons, I decided to use its Python port, Sacremoses for laserembeddings (Moses is implemented in Perl, Sacremoses is pure Python).

To make the tokenization accurate, Moses uses language-specific lists of non-breaking prefixes (see: https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes). If the list is not defined for a language (which is often the case), it defaults to English. Moses displays a warning in that case, Sacremoses don't.

The language identifier is only used during this tokenization step, any further step is totally language-independent.

Sinhala (and also 'xx', 'yy', 'x', 'y', etc) has no non-breaking prefixes, the tokenization rules for English are therefore used. This explains why you get the same results in your example. As LASER was trained on Sinhala (https://github.com/facebookresearch/LASER/#supported-languages), there shouldn't be an issue (and for this reason I think a warning would be misleading in that case).

Now, Sacremoses sometimes gives a slightly different output comparing to Moses, leading to potential embedding differences between LASER and laserembeddings (see https://github.com/yannvgn/laserembeddings#will-i-get-the-exact-same-embeddings). Unfortunately, Sinhala is not included in the test set I'm using to check the consistency between LASER and laserembeddings. You could do some testing by comparing the embeddings given by Facebook's original implementation and given by laserembeddings, if you want to be sure.

I hope this helps!

VinuraD · 2021-06-23T16:33:51Z

@yannvgn Thanks for the explanation which clarifies a lot. Just to know, can it be justified if there are dissimilarities such as for tags 'km' or 'fy' in your comparison ; https://github.com/yannvgn/laserembeddings/blob/master/tests/report/comparison-with-LASER.md . Does that mean these languages' embeddings aren't valid when using this particular version of 'laserembeddings'?. Is there a specific way that you use adjust the differences/correct them in future versions maybe ?

VinuraD changed the title ~~No Error raise for false tags and same results are obtained even the tag is changed~~ No Error raised for a false/wrong tag and same results are obtained even the tag is changed Jun 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No Error raised for a false/wrong tag and same results are obtained even the tag is changed #34

No Error raised for a false/wrong tag and same results are obtained even the tag is changed #34

VinuraD commented Jun 14, 2021

yannvgn commented Jun 20, 2021

VinuraD commented Jun 23, 2021 •

edited

Loading

No Error raised for a false/wrong tag and same results are obtained even the tag is changed #34

No Error raised for a false/wrong tag and same results are obtained even the tag is changed #34

Comments

VinuraD commented Jun 14, 2021

yannvgn commented Jun 20, 2021

VinuraD commented Jun 23, 2021 • edited Loading

VinuraD commented Jun 23, 2021 •

edited

Loading