SentenceTransformers is a Python framework for state-of-the-art sentence and text embeddings. The initial work is described in our paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.
The framework is based on PyTorch and Transformers and offers a large collection of pre-trained models tuned for various tasks. Further, it is easy to fine-tune your own models.
You can install it using pip:
pip install -U sentence-transformers
We recommand Python 3.6 or higher, and at least PyTorch 1.2.0. PyTorch 1.6.0 or higher is recommended and needed for some features. See installation for further installation options, especially if you want to use a GPU.
The usage is as simple as:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)
#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
print("Sentence:", sentence)
print("Embedding:", embedding)
print("")
Our models are evaluated extensively and achieve state-of-the-art performance on various tasks. Further, the code is tuned to provide the highest possible speed.
Model | STS benchmark | SentEval |
---|---|---|
Avg. GloVe embeddings | 58.02 | 81.52 |
BERT-as-a-service avg. embeddings | 46.35 | 84.04 |
BERT-as-a-service CLS-vector | 16.50 | 84.66 |
InferSent - GloVe | 68.03 | 85.59 |
Universal Sentence Encoder | 74.92 | 85.10 |
Sentence Transformer Models | ||
bert-base-nli-mean-tokens | 77.12 | 86.37 |
bert-large-nli-mean-tokens | 79.19 | 87.78 |
bert-base-nli-stsb-mean-tokens | 85.14 | 86.07 |
bert-large-nli-stsb-mean-tokens | 85.29 | 86.66 |
roberta-base-nli-stsb-mean-tokens | 85.44 | - |
roberta-large-nli-stsb-mean-tokens | 86.39 | - |
distilbert-base-nli-stsb-mean-tokens | 85.16 | - |
.. toctree:: :maxdepth: 2 :caption: Overview docs/installation docs/quickstart docs/pretrained_models docs/publications
.. toctree:: :maxdepth: 2 :caption: Usage docs/usage/computing_sentence_embeddings docs/usage/semantic_textual_similarity docs/usage/paraphrase_mining docs/usage/semantic_search
.. toctree:: :maxdepth: 2 :caption: Training docs/training/overview examples/training/multilingual/README examples/training/distillation/README
.. toctree:: :maxdepth: 2 :caption: Training Examples examples/training/sts/README examples/training/nli/README examples/training/quora_duplicate_questions/README
.. toctree:: :maxdepth: 1 :caption: Package Reference docs/package_reference/SentenceTransformer docs/package_reference/util docs/package_reference/models docs/package_reference/losses docs/package_reference/evaluation docs/package_reference/datasets