Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outlier reduction #2148

Open
1 task done
Hanqingxu123 opened this issue Sep 13, 2024 · 2 comments
Open
1 task done

outlier reduction #2148

Hanqingxu123 opened this issue Sep 13, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Hanqingxu123
Copy link

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

newplot (3)
Whenever I perform outlier reduction (regardless of the outlier reduction strategy), meaningless feature words such as 'the', 'end', 'of' frequently appear among the top 10 feature words for each topic. Why is this happening? When I don’t perform outlier reduction, the representative feature words are displayed correctly."

Reproduction

from bertopic import BERTopic

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = embedding_model.encode(data['document'], show_progress_bar=False)

Step 2 - Reduce dimensionality

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine',random_state=42)#

Step 3 - Cluster reduced embeddings

#cluster_model = KMeans(n_clusters=15)
hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=5)
from sklearn.feature_extraction.text import CountVectorizer

Step 4 - Tokenize topics

vectorizer_model = CountVectorizer(ngram_range=(1, 1),stop_words="english",min_df=2)#,

Step 5 - Create topic representation

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
representation_model = MaximalMarginalRelevance(diversity=0.2,top_n_words=15)#

#representation_model = KeyBERTInspired()
topic_model = BERTopic(
top_n_words=16,
embedding_model=embedding_model,
representation_model=representation_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
calculate_probabilities=True,
nr_topics="auto"
)
topics, probs= topic_model.fit_transform(data['document'],embeddings)

print(topic_model.get_topic_info())

new_topics = topic_model.reduce_outliers(data['document'],topics, strategy="c-tf-idf", threshold=0.1)
new_topics = topic_model.reduce_outliers(data['document'], new_topics, strategy="distributions")

topic_model.update_topics(data['document'], topics=new_topics)
newplot (3)
Whenever I perform outlier reduction (regardless of the outlier reduction strategy), meaningless feature words such as 'the', 'end', 'of' frequently appear among the top 10 feature words for each topic. Why is this happening? When I don’t perform outlier reduction, the representative feature words are displayed correctly."

BERTopic Version

1.6.1

@Hanqingxu123 Hanqingxu123 added the bug Something isn't working label Sep 13, 2024
@MaartenGr
Copy link
Owner

When you use .update_topics you can specify the types of representations that you want. Since you didn't specify anything, it uses the default c-TF-IDF. Instead, you would have to use it like so:

topic_model.update_topics(data['document'], topics=new_topics, representation_model=representation_model, ctfidf_model=ctfidf_model)

@Hanqingxu123
Copy link
Author

Thanks for your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants