Replies: 1 comment 1 reply
-
Nice catch! It indeed seems that the BERTopic/bertopic/_bertopic.py Line 2967 in d665d3f |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi MaartenGr,
I have noticed something that I am having trouble understanding and wondering if you can help. When using guided topic modelling using
seed_topic_list
and pre-calculated embeddings, after fitting the model the embeddings object in my environment is actually changed, even though it's not reassigned in my code (as far as I can tell).I am relatively new to Python (I come from R land), but how is this happening? I would have thought that when calling
topics, probs = topic_model.fit_transform(docs, embeddings)
the only objects that could change in my environment weretopic_model
,topics
, andprobs
.I'm guessing that this is happening because guided topic modelling actually nudges the embeddings in different directions based on the seeds and it is important to have those changed embeddings if later supplying them to another function like
visualise_documents
. Is that correct...? But I have had a look at the code and still don't understand exactly how myembeddings
object is being changed.The reason I discovered this was because I was lazily refitting the model without reinitialising
BERTopic()
to investigate the variability in UMAP, expecting it to just retrain the model from scratch, but I got very different document visualisations after the first and second fittings.Initial fit:
2nd fit:
(note: this is not the newsgroups data)
The second one is actually very pretty so I am wondering what's going on here? Is each subsequent training causing the document embeddings to converge around the seeds, leading to less local variation which means UMAP has an easier time evenly distributing the topics among a smaller set of dimensions?
The other curious thing is that the second fitting take substantially longer to run than the first (2-10x longer with 40k docs).
Beta Was this translation helpful? Give feedback.
All reactions