You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Whenever I perform outlier reduction (regardless of the outlier reduction strategy), meaningless feature words such as 'the', 'end', 'of' frequently appear among the top 10 feature words for each topic. Why is this happening? When I don’t perform outlier reduction, the representative feature words are displayed correctly."
topic_model.update_topics(data['document'], topics=new_topics)
Whenever I perform outlier reduction (regardless of the outlier reduction strategy), meaningless feature words such as 'the', 'end', 'of' frequently appear among the top 10 feature words for each topic. Why is this happening? When I don’t perform outlier reduction, the representative feature words are displayed correctly."
BERTopic Version
1.6.1
The text was updated successfully, but these errors were encountered:
When you use .update_topics you can specify the types of representations that you want. Since you didn't specify anything, it uses the default c-TF-IDF. Instead, you would have to use it like so:
Have you searched existing issues? 🔎
Desribe the bug
Whenever I perform outlier reduction (regardless of the outlier reduction strategy), meaningless feature words such as 'the', 'end', 'of' frequently appear among the top 10 feature words for each topic. Why is this happening? When I don’t perform outlier reduction, the representative feature words are displayed correctly."
Reproduction
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = embedding_model.encode(data['document'], show_progress_bar=False)
Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine',random_state=42)#
Step 3 - Cluster reduced embeddings
#cluster_model = KMeans(n_clusters=15)
hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=5)
from sklearn.feature_extraction.text import CountVectorizer
Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(ngram_range=(1, 1),stop_words="english",min_df=2)#,
Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
representation_model = MaximalMarginalRelevance(diversity=0.2,top_n_words=15)#
#representation_model = KeyBERTInspired()
topic_model = BERTopic(
top_n_words=16,
embedding_model=embedding_model,
representation_model=representation_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
calculate_probabilities=True,
nr_topics="auto"
)
topics, probs= topic_model.fit_transform(data['document'],embeddings)
print(topic_model.get_topic_info())
new_topics = topic_model.reduce_outliers(data['document'],topics, strategy="c-tf-idf", threshold=0.1)
new_topics = topic_model.reduce_outliers(data['document'], new_topics, strategy="distributions")
topic_model.update_topics(data['document'], topics=new_topics)
Whenever I perform outlier reduction (regardless of the outlier reduction strategy), meaningless feature words such as 'the', 'end', 'of' frequently appear among the top 10 feature words for each topic. Why is this happening? When I don’t perform outlier reduction, the representative feature words are displayed correctly."
BERTopic Version
1.6.1
The text was updated successfully, but these errors were encountered: