Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different topic assignment on training data when using saved model #2140

Open
1 task done
tmtsmrsl opened this issue Sep 8, 2024 · 1 comment
Open
1 task done

Comments

@tmtsmrsl
Copy link

tmtsmrsl commented Sep 8, 2024

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

When I save a model with pytorch serialization, then use the model to transform the training data, the new topic assignment is different from the "old" topic assignment.

Reproduction

from bertopic import BERTopic

topic_model_new = BERTopic.load("model")
# old topic assigment
new_df = topic_model_new.get_document_info(abstracts)
# new topic assignment
topics, probs = topic_model_new.transform(abstracts, embeddings)
(new_df['Topic'] == np.array(topics)).value_counts()

Topic
True 1168
False 157
Name: count, dtype: int64

BERTopic Version

0.16.3

@tmtsmrsl tmtsmrsl added the bug Something isn't working label Sep 8, 2024
@MaartenGr
Copy link
Owner

Thank you for reaching out. This is expected behavior because when you save a model using pytorch the underlying dimensionality reduction and clustering models are removed from the model. To then still have inference, a different technique is used to assign documents to topics (through cosine similarity between document and topic embeddings).

Do note that something similar might even happen when you use pickle because HDBSCAN does an approximation during inference and is likely to differ from its results during training.

@MaartenGr MaartenGr removed the bug Something isn't working label Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants