-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No Error raised for a false/wrong tag and same results are obtained even the tag is changed #34
Comments
Hi @VinuraD, The sentences are first tokenized before being embedded. To make the tokenization accurate, Moses uses language-specific lists of non-breaking prefixes (see: https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes). If the list is not defined for a language (which is often the case), it defaults to English. Moses displays a warning in that case, Sacremoses don't. The language identifier is only used during this tokenization step, any further step is totally language-independent. Sinhala (and also 'xx', 'yy', 'x', 'y', etc) has no non-breaking prefixes, the tokenization rules for English are therefore used. This explains why you get the same results in your example. As LASER was trained on Sinhala (https://github.com/facebookresearch/LASER/#supported-languages), there shouldn't be an issue (and for this reason I think a warning would be misleading in that case). Now, Sacremoses sometimes gives a slightly different output comparing to Moses, leading to potential embedding differences between LASER and laserembeddings (see https://github.com/yannvgn/laserembeddings#will-i-get-the-exact-same-embeddings). Unfortunately, Sinhala is not included in the test set I'm using to check the consistency between LASER and laserembeddings. You could do some testing by comparing the embeddings given by Facebook's original implementation and given by laserembeddings, if you want to be sure. I hope this helps! |
@yannvgn Thanks for the explanation which clarifies a lot. Just to know, can it be justified if there are dissimilarities such as for tags 'km' or 'fy' in your comparison ; https://github.com/yannvgn/laserembeddings/blob/master/tests/report/comparison-with-LASER.md . Does that mean these languages' embeddings aren't valid when using this particular version of 'laserembeddings'?. Is there a specific way that you use adjust the differences/correct them in future versions maybe ? |
Hi,
I was just testing different outputs with the 'laserembeddings' pypi package. One thing I observed is that, it doesn't raise an error for a false tag. (such as 'xx', 'yy' or even for single letters like 'x','y') Also, when I tried with Sinhala language, (tag='si') I observed that I get an output embedding even if I change the tag (to a valid or a false one) and all the time these outputs are the same. How can this behavior be explained? How can we verify the results? Or could there be something wrong with my setup?
Python==3.7.10
torch==1.8.1+cu101
Result:
1024 1024 True [[2.5131851e-03 4.6637398e-04 3.9160903e-05 ... 1.0697229e-02 1.6339000e-02 1.8368352e-02]]
The text was updated successfully, but these errors were encountered: