Different topic assignment on training data when using saved model #2140

tmtsmrsl · 2024-09-08T08:19:34Z

Have you searched existing issues? 🔎

I have searched and found no existing issues

Desribe the bug

When I save a model with pytorch serialization, then use the model to transform the training data, the new topic assignment is different from the "old" topic assignment.

Reproduction

from bertopic import BERTopic

topic_model_new = BERTopic.load("model")
# old topic assigment
new_df = topic_model_new.get_document_info(abstracts)
# new topic assignment
topics, probs = topic_model_new.transform(abstracts, embeddings)
(new_df['Topic'] == np.array(topics)).value_counts()

Topic
True 1168
False 157
Name: count, dtype: int64

BERTopic Version

0.16.3

MaartenGr · 2024-09-11T11:44:37Z

Thank you for reaching out. This is expected behavior because when you save a model using pytorch the underlying dimensionality reduction and clustering models are removed from the model. To then still have inference, a different technique is used to assign documents to topics (through cosine similarity between document and topic embeddings).

Do note that something similar might even happen when you use pickle because HDBSCAN does an approximation during inference and is likely to differ from its results during training.

tmtsmrsl added the bug Something isn't working label Sep 8, 2024

MaartenGr removed the bug Something isn't working label Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different topic assignment on training data when using saved model #2140

Different topic assignment on training data when using saved model #2140

tmtsmrsl commented Sep 8, 2024

MaartenGr commented Sep 11, 2024

Different topic assignment on training data when using saved model #2140

Different topic assignment on training data when using saved model #2140

Comments

tmtsmrsl commented Sep 8, 2024

Have you searched existing issues? 🔎

Desribe the bug

Reproduction

BERTopic Version

MaartenGr commented Sep 11, 2024