Release v0.3.3 - Multi-Process Tokenization and Information Retrieval Improvements · UKPLab/sentence-transformers

Multi-process tokenization (Linux only) for the model encode function. Significant speed-up when encoding large sets
Tokenization of datasets for training can now run in parallel (Linux Only)
New example for Quora Duplicate Questions Retrieval: See examples-folder
Many small improvements for training better models for Information Retrieval
Fixed LabelSampler (can be used to get batches with certain number of matching labels. Used for BatchHardTripletLoss). Moved it to DatasetFolder
Added new Evaluators for ParaphraseMining and InformationRetrieval
evaluation.BinaryEmbeddingSimilarityEvaluator no longer assumes a 50-50 split of the dataset. It computes the optimal threshold and measure accuracy
model.encode - When the convert_to_numpy parameter is set, the method returns a numpy matrix instead of a list of numpy vectors
New function: util.paraphrase_mining to perform paraphrase mining in a corpus. For an example see examples/training_quora_duplicate_questions/
New function: util.information_retrieval to perform information retrieval / semantic search in a corpus. For an example see examples/training_quora_duplicate_questions/

The evaluators (like EmbeddingSimilarityEvaluator) no longer accept a DataLoader as argument. Instead, the sentence and scores are directly passed. Old code that uses the previous evaluators needs to be changed. They can use the class method from_input_examples(). See examples/training_transformers/training_nli.py how to use the new evaluators.

Provide feedback