Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Error raised for a false/wrong tag and same results are obtained even the tag is changed #34

Open
VinuraD opened this issue Jun 14, 2021 · 2 comments

Comments

@VinuraD
Copy link

VinuraD commented Jun 14, 2021

Hi,
I was just testing different outputs with the 'laserembeddings' pypi package. One thing I observed is that, it doesn't raise an error for a false tag. (such as 'xx', 'yy' or even for single letters like 'x','y') Also, when I tried with Sinhala language, (tag='si') I observed that I get an output embedding even if I change the tag (to a valid or a false one) and all the time these outputs are the same. How can this behavior be explained? How can we verify the results? Or could there be something wrong with my setup?
Python==3.7.10
torch==1.8.1+cu101

from laserembeddings import Laser

laser = Laser()

embeddings = laser.embed_sentences(
    ["අත්සන් කළේ චරිත හේරත්"],
    lang='si')

embeddings2=laser.embed_sentences(
    ["අත්සන් කළේ චරිත හේරත්"],
    lang='y')

embeddings3=laser.embed_sentences(
    ["අත්සන් කළේ චරිත හේරත්"],
    lang='en')

embeddings4=laser.embed_sentences(
    ["A test sentence"],
    lang='si') #even the tag is different getting a result

comp=embeddings2==embeddings
comp2=embeddings2==embeddings3
print(np.sum(comp))
print(np.sum(comp2))
print(comp.all())
print(embeddings4)

Result:

1024 1024 True [[2.5131851e-03 4.6637398e-04 3.9160903e-05 ... 1.0697229e-02 1.6339000e-02 1.8368352e-02]]

@VinuraD VinuraD changed the title No Error raise for false tags and same results are obtained even the tag is changed No Error raised for a false/wrong tag and same results are obtained even the tag is changed Jun 14, 2021
@yannvgn
Copy link
Owner

yannvgn commented Jun 20, 2021

Hi @VinuraD,

The sentences are first tokenized before being embedded.
The tokenization step relies on Moses in Facebook's LASER original implementation. For portability reasons, I decided to use its Python port, Sacremoses for laserembeddings (Moses is implemented in Perl, Sacremoses is pure Python).

To make the tokenization accurate, Moses uses language-specific lists of non-breaking prefixes (see: https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes). If the list is not defined for a language (which is often the case), it defaults to English. Moses displays a warning in that case, Sacremoses don't.

The language identifier is only used during this tokenization step, any further step is totally language-independent.

Sinhala (and also 'xx', 'yy', 'x', 'y', etc) has no non-breaking prefixes, the tokenization rules for English are therefore used. This explains why you get the same results in your example. As LASER was trained on Sinhala (https://github.com/facebookresearch/LASER/#supported-languages), there shouldn't be an issue (and for this reason I think a warning would be misleading in that case).

Now, Sacremoses sometimes gives a slightly different output comparing to Moses, leading to potential embedding differences between LASER and laserembeddings (see https://github.com/yannvgn/laserembeddings#will-i-get-the-exact-same-embeddings). Unfortunately, Sinhala is not included in the test set I'm using to check the consistency between LASER and laserembeddings. You could do some testing by comparing the embeddings given by Facebook's original implementation and given by laserembeddings, if you want to be sure.

I hope this helps!

@VinuraD
Copy link
Author

VinuraD commented Jun 23, 2021

@yannvgn Thanks for the explanation which clarifies a lot. Just to know, can it be justified if there are dissimilarities such as for tags 'km' or 'fy' in your comparison ; https://github.com/yannvgn/laserembeddings/blob/master/tests/report/comparison-with-LASER.md . Does that mean these languages' embeddings aren't valid when using this particular version of 'laserembeddings'?. Is there a specific way that you use adjust the differences/correct them in future versions maybe ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants