Skip to content
Tianyu Gao edited this page May 19, 2021 · 4 revisions

Welcome to SimCSE Wiki!

Python package SimCSE is a sentence embedding tool that allows you to easily encode sentences into dense representations, build index for large corpora, and search semantically-similar sentences from the index. It is built upon our state-of-the-art sentence embedding model SimCSE: Simple Contrastive Learning of Sentence Embeddings. In this Wiki, we will show you how to use the package. Navigate it using the sidebar. In this page, we will show you the basic usage of the package.

First install the simcse package from pypi

pip install simcse

Or directly install it from our code

python setup.py install

Note that if you want to enable GPU encoding, you should install the correct version of PyTorch that supports CUDA. See PyTorch official website for instructions.

After installing the package, you can load our model by just two lines of code

from simcse import SimCSE
model = SimCSE("princeton-nlp/sup-simcse-bert-base-uncased")

See model list for a full list of available models.

Then you can use our model for encoding sentences into embeddings

embeddings = model.encode("A woman is reading.")

Compute the cosine similarities between two groups of sentences

sentences_a = ['A woman is reading.', 'A man is playing a guitar.']
sentences_b = ['He plays guitar.', 'A woman is making a photo.']
similarities = model.similarity(sentences_a, sentences_b)

Or build index for a group of sentences and search among them

sentences = ['A woman is reading.', 'A man is playing a guitar.']
model.build_index(sentences)
results = model.search("He plays guitar.")

We also support faiss, an efficient similarity search library. Just install the package following instructions here and simcse will automatically use faiss for efficient search.

WARNING: We have found that faiss did not well support Nvidia AMPERE GPUs (3090 and A100). In that case, you should change to other GPUs or install the CPU version of faiss package.