Extracted keyword bigrams/trigrams interpretability #719
Replies: 1 comment
-
Let me start by mentioning that I would suggest upgrading to BERTopic v0.12. Not only does it have significantly more features and more stability, it also allows for disabling lowercasing the documents as you can find here. That way, you can do the following instead to make sure it works for the German language: from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(lowercase=False)
topic_model = BERTopic(vectorizer_model=vectorizer_model) Then, as you mentioned, you could also work with a from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
topic_model = BERTopic(ctfidf_model=ctfidf_model) As a final note, you could look into other tokenizers and pass them into the CountVectorizer instead, like Spacy or SoMaJo. I found those after a quick search so there might be better options out there. |
Beta Was this translation helpful? Give feedback.
-
Maarten, first thanks for the great library!
I would be grateful for a piece of advice how to extract meaningful keywords. In _bertopic.py (ver 9.4), in function _preprocess_text you:
Here is my config:
{ "embedding_model": "paraphrase-multilingual-MiniLM-L12-v2", "umap_model": UMAP(n_neighbors=15, n_components=5, metric="cosine", low_memory=False), "hdbscan_model": HDBSCAN(min_cluster_size=10, metric="euclidean", prediction_data=True), "top_n_words": 10, "language": "multilingual", "min_topic_size": 20, "verbose": True, "nr_topics": "auto", "calculate_probabilities": True, "diversity": 1.0, }
and vectorizer_model param is:
{ "min_df": 0.01, "max_df": 0.3, "ngram_range": (1, 3), "lowercase": False, "stop_words": here comes a list of very common German stopwords, like der, das, ich, du, etc. }
My corpus is in German and assume I am topic modelling on general news.
I am passing the text as it is (no lemmatization, stoword removal), only cleaning up per advice in your Tips&Tricks page.
I see following problems in keywords extracted (get_topics, get_topic):
What would be the advised strategy to fix those problems. I am thinking about:
Thank you in advance,
Tomasz
Beta Was this translation helpful? Give feedback.
All reactions