Skip to content

Releases: UKPLab/sentence-transformers

v1.0.4 - Patch CLIPModel.save

01 Apr 06:35
Compare
Choose a tag to compare

It was not possible to fine-tune and save the CLIPModel. This release fixes it. CLIPModel can now be saved like any other model by calling model.save(path)

v1.0.3 - Patch util.paraphrase_mining

22 Mar 08:15
Compare
Choose a tag to compare

v1.0.3 - Patch for util.paraphrase_mining method

v1.0.2 - Patch CLIPModel

19 Mar 21:44
Compare
Choose a tag to compare

v1.0.2 - Patch for CLIPModel, new Image Examples

  • Bugfix in CLIPModel: Too long inputs raised a RuntimeError. Now they are truncated.
  • New util function: util.paraphrase_mining_embeddings, to find most similar embeddings in a matrix
  • Image Clustering and Duplicate Image Detection examples added: more info

v1.0.0 - Improvements, New Models, Text-Image Models

18 Mar 20:57
Compare
Choose a tag to compare

This release brings many new improvements and new features. Also, the version number scheme is updated. Now we use the format x.y.z with x: for major releases, y: smaller releases with new features, z: bugfixes

Text-Image-Model CLIP

You can now encode text and images in the same vector space using the OpenAI CLIP Model. You can use the model like this:

from sentence_transformers import SentenceTransformer, util
from PIL import Image

#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')

#Encode an image:
img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))

#Encode text descriptions
text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])

#Compute cosine similarities 
cos_scores = util.cos_sim(img_emb, text_emb)
print(cos_scores)

More Information
IPython Demo
Colab Demo

Examples how to train the CLIP model on your data will be added soon.

New Models

New Features

  • The Asym Model can now be used as the first model in a SentenceTransformer modules list.
  • Sorting when encoding changes: Previously, we encoded from short to long sentences. Now we encode from long to short sentences. Out-of-memory errors will then happen at the start. Also the approximation on the duration of the encode process is more precise
  • Improvement of the util.semantic_search method: It now uses the much faster torch.topk function. Further, you can define which scoring function should be used
  • New util methods: util.dot_score computes the dot product of two embedding matrices. util.normalize_embeddings will normalize embeddings to unit length
  • New parameter for SentenceTransformer.encode method: normalize_embeddings if set to true, it will normalize embeddings to unit length. In that case the faster util.dot_score can be used instead of util.cos_sim to compute cosine similarity scores.
  • If you specify in models.Transformer(do_lower_case=True) when creating a new SentenceTransformer, then all input will be lower cased.

New Examples

Bugfixes

  • Encode method now correctly returns token_embeddings if output_value='token_embeddings' is defined
  • Bugfix of the LabelAccuracyEvaluator
  • Bugfix of removing tensors off the CPU if you specified encode(sent, convert_to_tensor=True). They now stay on the GPU

Breaking changes:

  • SentenceTransformer.encode-Methode: Removed depcreated parameters is_pretokenized and num_workers

v0.4.1 - Faster Tokenization & Asymmetric Models

04 Jan 14:04
Compare
Choose a tag to compare

Refactored Tokenization

  • Faster tokenization speed: Using batched tokenization for training & inference - Now, all sentences in a batch are tokenized simoultanously.
  • Usage of the SentencesDataset no longer needed for training. You can pass your train examples directly to the DataLoader:
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
    InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
  • If you use a custom torch DataSet class: The dataset class must now return InputExample objects instead of tokenized texts
  • Class SentenceLabelDataset has been updated to new tokenization flow: It returns always two or more InputExamples with the same label

Asymmetric Models
Add new models.Asym class that allows different encoding of sentences based on some tag (e.g. query vs paragraph). Minimal example:

word_embedding_model = models.Transformer(base_model, max_seq_length=250)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
d1 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
d2 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
asym_model = models.Asym({'QRY': [d1], 'DOC': [d2]})
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, asym_model])

##Your input examples have to look like this:
inp_example = InputExample(texts=[{'QRY': 'your query'}, {'DOC': 'your document text'}], label=1)

##Encoding (Note: Mixed inputs are not allowed)
model.encode([{'QRY': 'your query1'}, {'QRY': 'your query2'}])

Inputs that have the key 'QRY' will be passed through the d1 dense layer, while inputs with they key 'DOC' through the d2 dense layer.
More documentation on how to design asymmetric models will follow soon.

New Namespace & Models for Cross-Encoder
Cross-Encoder are now hosted at https://huggingface.co/cross-encoder. Also, new pre-trained models have been added for: NLI & QNLI.

Logging
Log messages now use a custom logger from logging thanks to PR #623. This allows you which log messages you want to see from which components.

Unit tests
A lot more unit tests have been added, which test the different components of the framework.

v0.4.0 - Upgrade Transformers Version

22 Dec 13:42
Compare
Choose a tag to compare
  • Updated the dependencies so that it works with Huggingface Transformers version 4. Sentence-Transformers still works with huggingface transformers version 3, but an update to version 4 of transformers is recommended. Future changes might break with transformers version 3.
  • New naming of pre-trained models. Models will be named: {task}-{transformer_model}. So 'bert-base-nli-stsb-mean-tokens' becomes 'stsb-bert-base'. Models will still be available under their old names, but newer models will follow the updated naming scheme.
  • New application example for information retrieval and question answering retrieval. Together with respective pre-trained models

v0.3.9 - Small updates

18 Nov 08:25
Compare
Choose a tag to compare

This release only include some smaller updates:

  • Code was tested with transformers 3.5.1, requirement was updated so that it works with transformers 3.5.1
  • As some parts and models require Pytorch >= 1.6.0, requirement was updated to require at least pytorch 1.6.0. Most of the code and models will work with older pytorch versions.
  • model.encode() stored the embeddings on the GPU, which required quite a lot of GPU memory when encoding millions of sentences. The embeddings are now moved to CPU once they are computed.
  • The CrossEncoder-Class now accepts a max_length parameter to control the truncation of inputs
  • The Cross-Encoder predict method has now a apply_softmax parameter, that allows to apply softmax on-top of a multi-class output.

v0.3.8 - CrossEncoder, Data Augmentation, new Models

19 Oct 14:23
Compare
Choose a tag to compare
  • Add support training and using CrossEncoder
  • Data Augmentation method AugSBERT added
  • New model trained on large scale paraphrase data. Models works on internal benchmark much better than previous models: distilroberta-base-paraphrase-v1 and xlm-r-distilroberta-base-paraphrase-v1
  • New model for Information Retrieval trained on MS Marco: distilroberta-base-msmarco-v1
  • Improved MultipleNegativesRankingLoss loss function: Similarity function can be changed and is now cosine similarity (was dot-product before), further, similarity scores can be multiplied by a scaling factor. This allows the usage of NTXentLoss / InfoNCE loss.
  • New MegaBatchMarginLoss, inspired from the paper ParaNMT-Paper.

Smaller changes:

  • Update InformationRetrievalEvaluator, so that it can work with large corpora (Millions of entries). Removed the query_chunk_size parameter from the evaluator
  • SentenceTransformer.encode method detaches tensors from compute graph
  • SentenceTransformer.fit() method - Parameter output_path_ignore_not_empty deprecated. No longer checks that target folder must be empty

v0.3.7 - Upgrade transformers, Model Distillation Example, Multi-Input to Transformers Model

29 Sep 20:17
Compare
Choose a tag to compare
  • Upgrade transformers dependency, transformers 3.1.0, 3.2.0 and 3.3.1 are working
  • Added example code for model distillation: Sentence Embeddings models can be drastically reduced to e.g. only 2-4 layers while keeping 98+% of their performance. Code can be found in examples/training/distillation
  • Transformer models can now accepts two inputs ['sentence 1', 'context for sent1'], which are encoded as the two inputs for BERT.

Minor changes:

  • Tokenization in the multi-processes encoding setup now happens in the child processes, not in the parent process.
  • Added models.Normalize() to allow the normalization of embeddings to unit length

v0.3.6 - Update transformers to v3.1.0

11 Sep 08:06
Compare
Choose a tag to compare

Hugginface Transformers version 3.1.0 had a breaking change with previous version 3.0.2

This release fixes the issue so that Sentence-Transformers is compatible with Huggingface Transformers 3.1.0. Note, that this and future version will not be compatible with transformers < 3.1.0.