Time series of topics for large texts and a response variable #1258

dragonattheend · 2023-05-12T18:58:18Z

dragonattheend
May 12, 2023

Thank you for making BERTopic available and really helpful website and code examples.

I have a time-series of fairly large texts, hundreds/thousands of words, I have four columns in my dataset: id, time, text and response.

I've been looking for a way to get topics for each text in id and time and then relate them (their probabilities and changes in these probabilities) to my response variable. I want to see whether certain topics and/or change in the intensity of their discussion over time has an association with my response variable (and its changes).

Since my texts are large, I split them into sentences. Half of them got assigned to topic -1 and I intend to remove them.

Is it okay to group and average the probabilities of the remaining topics? Will it be representative of the whole text?

Is it okay to take time differences between probabilities for each topic?

It is very possible that I'm missing something and there is a simpler approach to what I'm trying to do. Please suggest if so. Thanks again!

Answered by MaartenGr

May 14, 2023

Is it okay to group and average the probabilities of the remaining topics? Will it be representative of the whole text?

That generally should be okay if the sub-documents are of a relatively equal size.

Is it okay to take time differences between probabilities for each topic?

If I understand you correctly, then I think so yes since the dynamic topic modeling in BERTopic is doing something similar.

It is very possible that I'm missing something and there is a simpler approach to what I'm trying to do. Please suggest if so. Thanks again!

If you are looking for statistically comparing response variable then it might be worthwhile to check out this thread that demonstrates the use of co…

View full answer

MaartenGr · 2023-05-14T09:27:52Z

MaartenGr
May 14, 2023
Maintainer

Is it okay to group and average the probabilities of the remaining topics? Will it be representative of the whole text?

That generally should be okay if the sub-documents are of a relatively equal size.

Is it okay to take time differences between probabilities for each topic?

If I understand you correctly, then I think so yes since the dynamic topic modeling in BERTopic is doing something similar.

It is very possible that I'm missing something and there is a simpler approach to what I'm trying to do. Please suggest if so. Thanks again!

If you are looking for statistically comparing response variable then it might be worthwhile to check out this thread that demonstrates the use of covariate analysis in BERTopic.

2 replies

dragonattheend May 15, 2023
Author

Thank you, this is very helpful.

Do you think it is okay to run separate models for each id and then try to overlap their results, like here?

MaartenGr May 15, 2023
Maintainer

You could do that but since their representations are meant to be quite different at certain timestamps, I am not sure whether that would work. Definitely worth trying out though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time series of topics for large texts and a response variable #1258

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Time series of topics for large texts and a response variable #1258

dragonattheend May 12, 2023

Replies: 1 comment · 2 replies

MaartenGr May 14, 2023 Maintainer

dragonattheend May 15, 2023 Author

MaartenGr May 15, 2023 Maintainer

dragonattheend
May 12, 2023

Replies: 1 comment 2 replies

MaartenGr
May 14, 2023
Maintainer

dragonattheend May 15, 2023
Author

MaartenGr May 15, 2023
Maintainer