Extracted keyword bigrams/trigrams interpretability #719

TomekSzymanski · 2022-09-15T19:48:26Z

TomekSzymanski
Sep 15, 2022

Maarten, first thanks for the great library!

I would be grateful for a piece of advice how to extract meaningful keywords. In _bertopic.py (ver 9.4), in function _preprocess_text you:

lowercase all
replace \n, \t to ' '
for English take only alpha-numerics.

Here is my config:
{ "embedding_model": "paraphrase-multilingual-MiniLM-L12-v2", "umap_model": UMAP(n_neighbors=15, n_components=5, metric="cosine", low_memory=False), "hdbscan_model": HDBSCAN(min_cluster_size=10, metric="euclidean", prediction_data=True), "top_n_words": 10, "language": "multilingual", "min_topic_size": 20, "verbose": True, "nr_topics": "auto", "calculate_probabilities": True, "diversity": 1.0, }
and vectorizer_model param is:
{ "min_df": 0.01, "max_df": 0.3, "ngram_range": (1, 3), "lowercase": False, "stop_words": here comes a list of very common German stopwords, like der, das, ich, du, etc. }
My corpus is in German and assume I am topic modelling on general news.
I am passing the text as it is (no lemmatization, stoword removal), only cleaning up per advice in your Tips&Tricks page.

I see following problems in keywords extracted (get_topics, get_topic):

they are all lower-cased (German nouns must start with capitalized letter, also proper nouns in other languages, most named entities)
we are losing "-" (dash) of the compound words built with dash ("EU-Kommission" is lost, either part can be used)
word ending one sentence might be glued together with word from next sentence (this one I can fix increasing min_df?)
stop words are missing (as we exclude them), but for example "Mehrheit der Abgeordneten", which makes sense, becomes "Mehrheit Abgeordneten"
quotation marks are not persisted
Stop word might be also a meaningful word. Like 'oder' (or) and 'Oder' (river) - they are both filtered out.

What would be the advised strategy to fix those problems. I am thinking about:

post-processing keywords after I get them from topic (get_topic): building a corpus per topic from get_representative_docs and (fuzzy)-matching keywords in this corpus
tweaking passed CountVectorizer (tokenizer, token_pattern)

Thank you in advance,
Tomasz

MaartenGr · 2022-09-16T09:44:25Z

MaartenGr
Sep 16, 2022
Maintainer

In _bertopic.py (ver 9.4),

Let me start by mentioning that I would suggest upgrading to BERTopic v0.12. Not only does it have significantly more features and more stability, it also allows for disabling lowercasing the documents as you can find here.

That way, you can do the following instead to make sure it works for the German language:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(lowercase=False)
topic_model = BERTopic(vectorizer_model=vectorizer_model)

Then, as you mentioned, you could also work with a token_pattern that fits with your use case. Lastly, you could also increase the n-gram range and use the ClassTfidfTransformer to reduce frequent words, like stop words, instead of filtering them out completely.

from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
topic_model = BERTopic(ctfidf_model=ctfidf_model)

As a final note, you could look into other tokenizers and pass them into the CountVectorizer instead, like Spacy or SoMaJo. I found those after a quick search so there might be better options out there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracted keyword bigrams/trigrams interpretability #719

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Extracted keyword bigrams/trigrams interpretability #719

TomekSzymanski Sep 15, 2022

Replies: 1 comment

MaartenGr Sep 16, 2022 Maintainer

TomekSzymanski
Sep 15, 2022

MaartenGr
Sep 16, 2022
Maintainer