Name	Name	Last commit message	Last commit date
parent directory ..
musr_embedding	musr_embedding
README.md	README.md

MuSR: A One-for-all Multilingual Sentence Representation Model for 223 Languages

musr_embedding is a python package of MuSR, a Multilingual Sentence Representation model supporting 223 languages, to compute multilingual sentence embeddings.

Dependencies

Python >= 3.7.0
PyTorch >= 1.11.0
NumPy, tested with 1.19.2
Fairseq, tested with 0.12.2
SentencePiece, tested with 0.1.96
Gdown, tested with 4.7.1

Usage

import musr_embedding


model = musr_embedding.MuSR()
# spm model, vocab file, and model checkpoint are downloaded automatically.

embeddings = model.embed_sentences(['The weather is good today.', '今天天气很好。'])
# embeddings is a N*Dim Numpy array, where N = number of sentences, and Dim = sentence embedding dimension.

If you download the models into a specific directory:

import musr_embedding


path_to_spm_model = ...
path_to_vocab_file = ...
path_to_checkpoint = ...

model = musr_embedding.MuSR(
    spm_model=path_to_spm_model, 
    vocab_file=path_to_vocab_file, 
    model_path=path_to_checkpoint)

Supported languages

The MuSR model is trained on the following languages:

Acehnese (Arabic script), Acehnese (Latin script), Afrikaans, Akan, Algerian Arabic, Amharic, Armenian, Assamese, Asturian, Awadhi, Ayacucho Quechua, Balinese, Bambara, Banjar (Arabic script), Banjar (Latin script), Bashkir, Basque, Belarusian, Bemba, Bengali, Berber languages, Bhojpuri, Bosnian, Breton, Buginese, Bulgarian, Burmese, Catalan, Cebuano, Central Atlas Tamazight, Central Aymara, Central Kanuri (Arabic script), Central Kanuri (Latin script), Central Kurdish, Chamorro, Chhattisgarhi, Chinese (Simplified), Chinese (Traditional), Chokwe, Chuvash, Cornish, Crimean Tatar, Croatian, Czech, Danish, Dari, Divehi, Dutch, Dyula, Dzongkha, Eastern Panjabi, Eastern Yiddish, Egyptian Arabic, English, Esperanto, Estonian, Ewe, Faroese, Fijian, Filipino, Finnish, Fon, French, Friulian, Galician, Ganda, Georgian, German, Greek, Guarani, Gujarati, Haitian Creole, Halh Mongolian, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Igbo, Ilocano / Iloko, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Javanese, Jingpho, Kabiyè, Kabuverdianu, Kabyle, Kamba, Kannada, Kashmiri (Arabic script), Kashmiri (Devanagari script), Kashubian, Kazakh, Khmer, Kikongo, Kikuyu, Kimbundu, Kinyarwanda, Korean, Kyrgyz, Lao, Latgalian, Latin, Ligurian, Limburgish, Lingala, Lingua Franca Nova, Lithuanian, Lojban, Lombard, Low German, Luba-Kasai, Luo, Luxembourgish, Macedonian, Magahi, Maithili, Malayalam, Maltese, Maori, Marathi, Meitei (Bengali script), Mesopotamian Arabic, Minangkabau (Latin script), Mizo, Modern Standard Arabic, Moroccan Arabic, Mossi, Najdi Arabic, Nepali, Nigerian Fulfulde, North Azerbaijani, North Levantine Arabic, Northern Kurdish, Northern Sotho, Northern Uzbek, Norwegian Bokmål, Norwegian Nynorsk, Nuer, Nyanja, Occitan, Odia, Pangasinan, Papiamento, Plateau Malagasy, Polish, Portuguese, Romanian, Rundi, Russian, Samoan, Sango, Sanskrit, Santali, Sardinian, Scottish Gaelic, Serbian, Serbo-Croatian, Shan, Shanghainese, Shona, Sicilian, Silesian, Sindhi, Sinhala, Slovak, Slovenian, Somali, South Azerbaijani, South Levantine Arabic, Southern Pashto, Southern Sotho, Southwestern Dinka, Spanish, Standard Latvian, Standard Malay, Standard Tibetan, Sundanese, Swahili, Swati, Swedish, Tagalog, Tajik, Tamasheq (Latin script), Tamasheq (Tifinagh script), Tamil, Tatar, Ta’izzi-Adeni Arabic, Telugu, Thai, Tigrinya, Tok Pisin, Tosk Albanian, Tsonga, Tswana, Tumbuka, Tunisian Arabic, Turkish, Turkmen, Twi, Ukrainian, Umbundu, Upper Sorbian, Urdu, Uyghur, Venetian, Vietnamese, Walloon, Waray, Welsh, West Central Oromo, Western Frisian, Western Persian, Wolof, Xhosa, Yoruba, Yue Chinese, and Zulu.

Acknowledgements

The code structure of this project is mainly adapted from LASER and laserembeddings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MuSR

MuSR

README.md

MuSR: A One-for-all Multilingual Sentence Representation Model for 223 Languages

Dependencies

Usage

Supported languages

Acknowledgements

Files

MuSR

Directory actions

More options

Directory actions

More options

Latest commit

History

MuSR

Folders and files

parent directory

README.md

MuSR: A One-for-all Multilingual Sentence Representation Model for 223 Languages

Dependencies

Usage

Supported languages

Acknowledgements