You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I save a model with pytorch serialization, then use the model to transform the training data, the new topic assignment is different from the "old" topic assignment.
Reproduction
frombertopicimportBERTopictopic_model_new=BERTopic.load("model")
# old topic assigmentnew_df=topic_model_new.get_document_info(abstracts)
# new topic assignmenttopics, probs=topic_model_new.transform(abstracts, embeddings)
(new_df['Topic'] ==np.array(topics)).value_counts()
Thank you for reaching out. This is expected behavior because when you save a model using pytorch the underlying dimensionality reduction and clustering models are removed from the model. To then still have inference, a different technique is used to assign documents to topics (through cosine similarity between document and topic embeddings).
Do note that something similar might even happen when you use pickle because HDBSCAN does an approximation during inference and is likely to differ from its results during training.
Have you searched existing issues? 🔎
Desribe the bug
When I save a model with pytorch serialization, then use the model to transform the training data, the new topic assignment is different from the "old" topic assignment.
Reproduction
BERTopic Version
0.16.3
The text was updated successfully, but these errors were encountered: