How to assign documents to the correct topic？ #2259

SongLin99 · 2024-12-30T06:44:33Z

SongLin99
Dec 30, 2024

Dear developer and anyone who see this

After topic_model.fit_transform, i have assign my 10000 documents to 10 topics,
but i find some of them are not very reasonable，
for example
document1 "Advances in intracranial electroencephalography (iEEG) and neurophysiology have enabled the study of previously inaccessible brain regions with high fidelity temporal and spatial resolution...." should be assigned to Topic5 EEG and BCI.
However, The document1 has been assigned to Topic 6 adhd and cognition.

I think i have to tune some parameters to correct this fault. Here are my solutions
1. Maybe the UMAP component is too low to tell the difference between Topic3 and 6, I need to set component to >20 (not helpful)
2. I want to check contribution of each words by topic_model.visualize_approximate_distribution , but as shown in picture below,the result is different from probs.

How can i make the assignment right?

MaartenGr · 2025-01-03T07:16:03Z

MaartenGr
Jan 3, 2025
Maintainer

It's very difficult to say anything without knowing a bit more about your code. Could you share your full code for creating the topic model? Most important are how you initialized BERTopic and any processing you did afterwards. Also, which version of BERTopic are you using?

I want to check contribution of each words by topic_model.visualize_approximate_distribution , but as shown in picture below,the result is different from probs.

That is to be expected as this method uses a different technique of calculating the probabilities, hence the approximate in the functions name.

If you are missing particular topic, then it might help to tweak the parameters of approximate_distribution a bit to for example lower the minimum threshold or widen the window.

2 replies

SongLin99 Jan 3, 2025
Author

Hello, maartenGr
Thank you for your help
Of course, I am conducting topic modeling on 10000 articles，and my code is here and embedding model is all-mpnet-base-v2

`
UMAP_neighbors = 100
UMAP_components = 20
UMAP_min_dist = 0.0
UMAP_metric = 'cosine'

HDB_size = 200
HDB_samples = 35

umap_model = UMAP(n_neighbors=UMAP_neighbors, n_components=UMAP_components, metric = UMAP_metric, min_dist=UMAP_min_dist, random_state=40)
hdbscan_model = HDBSCAN(min_cluster_size=HDB_size, min_samples=HDB_samples, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

topic_model = BERTopic(

embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
representation_model=representation_model,

top_n_words=20,
verbose=True,
calculate_probabilities=True,
n_gram_range=(1,3),
#nr_topics='auto'
)

topics, probs = topic_model.fit_transform(abstract_list, embeddings)
`

I want to know if the parameters of my UMAP and HDBSCAN are suitable for my task.
Thank you again!

MaartenGr Jan 8, 2025
Maintainer

Which version of BERTopic are you using? Did you make sure you are using the latest version?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to assign documents to the correct topic？ #2259

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to assign documents to the correct topic？ #2259

SongLin99 Dec 30, 2024

Replies: 1 comment · 2 replies

MaartenGr Jan 3, 2025 Maintainer

SongLin99 Jan 3, 2025 Author

MaartenGr Jan 8, 2025 Maintainer

SongLin99
Dec 30, 2024

Replies: 1 comment 2 replies

MaartenGr
Jan 3, 2025
Maintainer

SongLin99 Jan 3, 2025
Author

MaartenGr Jan 8, 2025
Maintainer