outlier reduction #2148

Hanqingxu123 · 2024-09-13T05:23:04Z

Have you searched existing issues? 🔎

I have searched and found no existing issues

Desribe the bug

Whenever I perform outlier reduction (regardless of the outlier reduction strategy), meaningless feature words such as 'the', 'end', 'of' frequently appear among the top 10 feature words for each topic. Why is this happening? When I don’t perform outlier reduction, the representative feature words are displayed correctly."

Reproduction

from bertopic import BERTopic

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = embedding_model.encode(data['document'], show_progress_bar=False)

Step 2 - Reduce dimensionality

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine',random_state=42)#

Step 3 - Cluster reduced embeddings

#cluster_model = KMeans(n_clusters=15)
hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=5)
from sklearn.feature_extraction.text import CountVectorizer

Step 4 - Tokenize topics

vectorizer_model = CountVectorizer(ngram_range=(1, 1),stop_words="english",min_df=2)#,

Step 5 - Create topic representation

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
representation_model = MaximalMarginalRelevance(diversity=0.2,top_n_words=15)#

#representation_model = KeyBERTInspired()
topic_model = BERTopic(
top_n_words=16,
embedding_model=embedding_model,
representation_model=representation_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
calculate_probabilities=True,
nr_topics="auto"
)
topics, probs= topic_model.fit_transform(data['document'],embeddings)

print(topic_model.get_topic_info())

new_topics = topic_model.reduce_outliers(data['document'],topics, strategy="c-tf-idf", threshold=0.1)
new_topics = topic_model.reduce_outliers(data['document'], new_topics, strategy="distributions")

topic_model.update_topics(data['document'], topics=new_topics)

Whenever I perform outlier reduction (regardless of the outlier reduction strategy), meaningless feature words such as 'the', 'end', 'of' frequently appear among the top 10 feature words for each topic. Why is this happening? When I don’t perform outlier reduction, the representative feature words are displayed correctly."

BERTopic Version

1.6.1

MaartenGr · 2024-09-13T09:44:14Z

When you use .update_topics you can specify the types of representations that you want. Since you didn't specify anything, it uses the default c-TF-IDF. Instead, you would have to use it like so:

topic_model.update_topics(data['document'], topics=new_topics, representation_model=representation_model, ctfidf_model=ctfidf_model)

Hanqingxu123 · 2024-09-16T09:44:33Z

Thanks for your reply!

Hanqingxu123 added the bug Something isn't working label Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

outlier reduction #2148

outlier reduction #2148

Hanqingxu123 commented Sep 13, 2024

MaartenGr commented Sep 13, 2024

Hanqingxu123 commented Sep 16, 2024

outlier reduction #2148

outlier reduction #2148

Comments

Hanqingxu123 commented Sep 13, 2024

Have you searched existing issues? 🔎

Desribe the bug

Reproduction

Step 2 - Reduce dimensionality

Step 3 - Cluster reduced embeddings

Step 4 - Tokenize topics

Step 5 - Create topic representation

BERTopic Version

MaartenGr commented Sep 13, 2024

Hanqingxu123 commented Sep 16, 2024