Releases · UKPLab/sentence-transformers

01 Apr 06:35

nreimers

v1.0.4

836f822

It was not possible to fine-tune and save the CLIPModel. This release fixes it. CLIPModel can now be saved like any other model by calling model.save(path)

Assets 2

22 Mar 08:15

nreimers

v1.0.3

6353eb9

v1.0.3 - Patch util.paraphrase_mining

v1.0.3 - Patch for util.paraphrase_mining method

Assets 2

19 Mar 21:44

nreimers

v1.0.2

4918bc4

v1.0.2 - Patch CLIPModel

v1.0.2 - Patch for CLIPModel, new Image Examples

Bugfix in CLIPModel: Too long inputs raised a RuntimeError. Now they are truncated.
New util function: util.paraphrase_mining_embeddings, to find most similar embeddings in a matrix
Image Clustering and Duplicate Image Detection examples added: more info

Assets 2

18 Mar 20:57

nreimers

v1.0.0

7a2c690

v1.0.0 - Improvements, New Models, Text-Image Models

This release brings many new improvements and new features. Also, the version number scheme is updated. Now we use the format x.y.z with x: for major releases, y: smaller releases with new features, z: bugfixes

Text-Image-Model CLIP

You can now encode text and images in the same vector space using the OpenAI CLIP Model. You can use the model like this:

from sentence_transformers import SentenceTransformer, util
from PIL import Image

#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')

#Encode an image:
img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))

#Encode text descriptions
text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])

#Compute cosine similarities 
cos_scores = util.cos_sim(img_emb, text_emb)
print(cos_scores)

More Information
IPython Demo
Colab Demo

Examples how to train the CLIP model on your data will be added soon.

New Models

Add v3 models trained for semantic search on MS MARCO: MS MARCO Models v3
First models trained on Natural Questions dataset for Q&A Retrieval: Natural Questions Models v1
Add DPR Models from Facebook for Q&A Retrieval: DPR-Models

New Features

The Asym Model can now be used as the first model in a SentenceTransformer modules list.
Sorting when encoding changes: Previously, we encoded from short to long sentences. Now we encode from long to short sentences. Out-of-memory errors will then happen at the start. Also the approximation on the duration of the encode process is more precise
Improvement of the util.semantic_search method: It now uses the much faster torch.topk function. Further, you can define which scoring function should be used
New util methods: util.dot_score computes the dot product of two embedding matrices. util.normalize_embeddings will normalize embeddings to unit length
New parameter for SentenceTransformer.encode method: normalize_embeddings if set to true, it will normalize embeddings to unit length. In that case the faster util.dot_score can be used instead of util.cos_sim to compute cosine similarity scores.
If you specify in models.Transformer(do_lower_case=True) when creating a new SentenceTransformer, then all input will be lower cased.

New Examples

Add example for model quantization on CPUs (smaller models, faster run-time): model_quantization.py
Start to add example how to train SBERT models without training data: unsupervised learning. We start with an example for Query Generation to train a semantic search model.

Bugfixes

Encode method now correctly returns token_embeddings if output_value='token_embeddings' is defined
Bugfix of the LabelAccuracyEvaluator
Bugfix of removing tensors off the CPU if you specified encode(sent, convert_to_tensor=True). They now stay on the GPU

Breaking changes:

SentenceTransformer.encode-Methode: Removed depcreated parameters is_pretokenized and num_workers

Assets 2

04 Jan 14:04

nreimers

v0.4.1

de558ab

v0.4.1 - Faster Tokenization & Asymmetric Models

Refactored Tokenization

Faster tokenization speed: Using batched tokenization for training & inference - Now, all sentences in a batch are tokenized simoultanously.
Usage of the SentencesDataset no longer needed for training. You can pass your train examples directly to the DataLoader:

train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
    InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

If you use a custom torch DataSet class: The dataset class must now return InputExample objects instead of tokenized texts
Class SentenceLabelDataset has been updated to new tokenization flow: It returns always two or more InputExamples with the same label

Asymmetric Models
Add new models.Asym class that allows different encoding of sentences based on some tag (e.g. query vs paragraph). Minimal example:

word_embedding_model = models.Transformer(base_model, max_seq_length=250)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
d1 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
d2 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
asym_model = models.Asym({'QRY': [d1], 'DOC': [d2]})
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, asym_model])

##Your input examples have to look like this:
inp_example = InputExample(texts=[{'QRY': 'your query'}, {'DOC': 'your document text'}], label=1)

##Encoding (Note: Mixed inputs are not allowed)
model.encode([{'QRY': 'your query1'}, {'QRY': 'your query2'}])

Inputs that have the key 'QRY' will be passed through the d1 dense layer, while inputs with they key 'DOC' through the d2 dense layer.
More documentation on how to design asymmetric models will follow soon.

New Namespace & Models for Cross-Encoder
Cross-Encoder are now hosted at https://huggingface.co/cross-encoder. Also, new pre-trained models have been added for: NLI & QNLI.

Logging
Log messages now use a custom logger from logging thanks to PR #623. This allows you which log messages you want to see from which components.

Unit tests
A lot more unit tests have been added, which test the different components of the framework.

Assets 2

22 Dec 13:42

nreimers

v0.4.0

28d6f90

v0.4.0 - Upgrade Transformers Version

Updated the dependencies so that it works with Huggingface Transformers version 4. Sentence-Transformers still works with huggingface transformers version 3, but an update to version 4 of transformers is recommended. Future changes might break with transformers version 3.
New naming of pre-trained models. Models will be named: {task}-{transformer_model}. So 'bert-base-nli-stsb-mean-tokens' becomes 'stsb-bert-base'. Models will still be available under their old names, but newer models will follow the updated naming scheme.
New application example for information retrieval and question answering retrieval. Together with respective pre-trained models

Assets 2

18 Nov 08:25

nreimers

v0.3.9

005fd08

v0.3.9 - Small updates

This release only include some smaller updates:

Code was tested with transformers 3.5.1, requirement was updated so that it works with transformers 3.5.1
As some parts and models require Pytorch >= 1.6.0, requirement was updated to require at least pytorch 1.6.0. Most of the code and models will work with older pytorch versions.
model.encode() stored the embeddings on the GPU, which required quite a lot of GPU memory when encoding millions of sentences. The embeddings are now moved to CPU once they are computed.
The CrossEncoder-Class now accepts a max_length parameter to control the truncation of inputs
The Cross-Encoder predict method has now a apply_softmax parameter, that allows to apply softmax on-top of a multi-class output.

Assets 2

19 Oct 14:23

nreimers

v0.3.8

3d12b0c

v0.3.8 - CrossEncoder, Data Augmentation, new Models

Add support training and using CrossEncoder
Data Augmentation method AugSBERT added
New model trained on large scale paraphrase data. Models works on internal benchmark much better than previous models: distilroberta-base-paraphrase-v1 and xlm-r-distilroberta-base-paraphrase-v1
New model for Information Retrieval trained on MS Marco: distilroberta-base-msmarco-v1
Improved MultipleNegativesRankingLoss loss function: Similarity function can be changed and is now cosine similarity (was dot-product before), further, similarity scores can be multiplied by a scaling factor. This allows the usage of NTXentLoss / InfoNCE loss.
New MegaBatchMarginLoss, inspired from the paper ParaNMT-Paper.

Smaller changes:

Update InformationRetrievalEvaluator, so that it can work with large corpora (Millions of entries). Removed the query_chunk_size parameter from the evaluator
SentenceTransformer.encode method detaches tensors from compute graph
SentenceTransformer.fit() method - Parameter output_path_ignore_not_empty deprecated. No longer checks that target folder must be empty

Assets 2

29 Sep 20:17

nreimers

v0.3.7

a37ba6a

v0.3.7 - Upgrade transformers, Model Distillation Example, Multi-Input to Transformers Model

Upgrade transformers dependency, transformers 3.1.0, 3.2.0 and 3.3.1 are working
Added example code for model distillation: Sentence Embeddings models can be drastically reduced to e.g. only 2-4 layers while keeping 98+% of their performance. Code can be found in examples/training/distillation
Transformer models can now accepts two inputs ['sentence 1', 'context for sent1'], which are encoded as the two inputs for BERT.

Minor changes:

Tokenization in the multi-processes encoding setup now happens in the child processes, not in the parent process.
Added models.Normalize() to allow the normalization of embeddings to unit length

Assets 2

11 Sep 08:06

nreimers

v0.3.6

18c057c

v0.3.6 - Update transformers to v3.1.0

Hugginface Transformers version 3.1.0 had a breaking change with previous version 3.0.2

This release fixes the issue so that Sentence-Transformers is compatible with Huggingface Transformers 3.1.0. Note, that this and future version will not be compatible with transformers < 3.1.0.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text-Image-Model CLIP

New Models

New Features

New Examples

Bugfixes

Breaking changes:

Releases: UKPLab/sentence-transformers

v1.0.4 - Patch CLIPModel.save

v1.0.3 - Patch util.paraphrase_mining

v1.0.2 - Patch CLIPModel

v1.0.0 - Improvements, New Models, Text-Image Models

Text-Image-Model CLIP

New Models

New Features

New Examples

Bugfixes

Breaking changes:

v0.4.1 - Faster Tokenization & Asymmetric Models

v0.4.0 - Upgrade Transformers Version

v0.3.9 - Small updates

v0.3.8 - CrossEncoder, Data Augmentation, new Models

v0.3.7 - Upgrade transformers, Model Distillation Example, Multi-Input to Transformers Model

v0.3.6 - Update transformers to v3.1.0