Releases: ddangelov/Top2Vec
updated code documentation
1.0.13 update Top2Vec version
added pre-trained universal sentence encoder and BERT sentence transformer options
Top2Vec now has an option to choose the embedding model with doc2vec
, universal-sentence-encoder
, universal-sentence-encoder-multilingual
, and distiluse-base-multilingual-cased
as the options.
A get_documents_topics
method was added.
added delete_documents methods and bug fixes
Added a method for deleting documents from model.
Fixed bug when using corpus_file
that resulted in documents getting dropped. Fixed bug when using add_documents
and delete_documents
which resulted in improper ordering of topic words.
UMAP install bug fix
There was an issue with UMAP install due to a missing comma in the setup.py file, this has been fixed. An optional min_count
parameter has been added, the default is still 50. All words with total frequency lower min_count
are ignored by the model.
Hierarchical Topic Reduction
Added functionality to perform hierarchical topic reduction. Added the ability to add new documents to an already trained model. Added use_corpus option which may lead to faster training with very large datasets in multi-worker environments.
Custom document ids, tokenizer input, option to save documents
Added option for custom document ids, these can be string or int. Option to not save documents in model, this allows for the trained model to be used as an index and for saved models to be smaller in size. Ability to pass in a custom tokenizer that will override the default. Verbose mode that will log status of training. Also added the ability to search documents by multiple documents, positive and negative semantic search.
Topic size and deduplication
Topic size is defined as the number of document vectors which have the topic as its nearest topic vector. Search by topic has been modified to only show documents who have the topic as its nearest topic, in order to avoid overlapping results from similar topics.
Topic deduplication is added to make topics more robust.
First Release
Top2Vec initial release.