How countvectorizor work in bertopic? I am also in dilemma #2226

hari-chalise · 2024-11-26T07:02:09Z

hari-chalise
Nov 26, 2024

I have questions after embedding data are converted into vector then this is converted in to low dimenaion by reduction and clustered them. But in bertopic documentation after clustering countvectorizor is used for tokenization and stop words removing. My question is that how to do so? You answer me countvectorizor used for whole document but is clustered data not used in tokenization? Or how to interrelate clustering into next step i am so confused. @MaartenGr please clearify me with example .

MaartenGr · 2024-11-26T07:53:00Z

MaartenGr
Nov 26, 2024
Maintainer

The CountVectorizer is used for the clustered data. I would advise looking through the documentation here and highly advise going through the tutorial here as I shared before in a previous issue you opened with the same question. This will give you the intuition you need since you'll implement it yourself.

0 replies

hari-chalise · 2024-11-26T14:30:17Z

hari-chalise
Nov 26, 2024
Author

Dear MaartenGr/BERTopic, I found and read your document. I am confused. I don't know either. I have no capacity to grabe knowledge from your clear document. My question is why do we perform tokenization and stop word removal after clustering? And not in the initial stage. And how tokenization actually performs in this stage. We already have numeric or vector data in embedding dimensionality reduction and clustering. But tokenization is performed in original text data and stop words are also in text. Please clarify these points for me and if possible please clarify me with examples between clustering and topic representation. My main question is that in which data is used to countvectorizer process? In your previous response in github you say the countvectorizer is applied in integration of whole data, but clustering data is the numeric data after embedding and Tokenization performed. Please clarify me in this step for me. I have a Proposal defense after one days so i need to clear. At last please let me know with an example what exactly topic modeling output is. In my understanding if there is "football, cricket, table tennis, huckey" then this is represented by sports topics. With regards Hari Lal Chalise MSC.CSIT TU Nepal

…

On Tue, Nov 26, 2024 at 1:38 PM Maarten Grootendorst < ***@***.***> wrote: The CountVectorizer is used for the clustered data. I would advise looking through the documentation here <https://maartengr.github.io/BERTopic/algorithm/algorithm.html> and highly advise going through the tutorial here <https://www.maartengrootendorst.com/blog/topicbert/> as I shared before in a previous issue you opened with the same question. This will give you the intuition you need since you'll implement it yourself. — Reply to this email directly, view it on GitHub <#2226 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AM7ILKYMUYHCKCVBTNFTWSD2CQSHDAVCNFSM6AAAAABSPV6YDCVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMZYGA3DQNI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

MaartenGr Nov 26, 2024
Maintainer

I really urge you to follow the tutorial previously mentioned here. Click that link and follow the tutorial. It will show you a step-by-step guide of how to implement the very basics of BERTopic. You can find all the answers there if you run that code and inspect the variables.

I can give a brief answer to the following questions but you will have to follow the tutorial for the rest of it:

My question is why do we perform tokenization and stop word
removal after clustering? And not in the initial stage.

Because we use embeddings, these are contextual and more accurate if you keep the original text intact.

My main question is that in which data is used to countvectorizer process?

Each cluster has a number of textual documents (these are strings). All documents (not the numeric values) are passed tot he CountVectorizer.

In your previous response in github you say the countvectorizer is applied
in integration of whole data, but clustering data is the numeric data after
embedding and Tokenization performed. Please clarify me in this step for me.

After clustering, I use the documents themselves not the embeddings. You can find a nice example of this in the tutorial I shared at the top, please follow it.

At last please
let me know with an example what exactly topic modeling output is. In my
understanding if there is "football, cricket, table tennis, huckey" then
this is represented by sports topics.

That is correct. You can find more examples in the whole documentation, the many examples in the README and the link I provided you previously.

hari-chalise · 2024-11-26T16:20:34Z

hari-chalise
Nov 26, 2024
Author

Thank you for the clarification. But, if you use the documents after clustering then how embedding is used in the whole process? please how exactly these steps interrelated each other? If embedding is used for dimensionality reduction and clustering then when and how these embedding are used in the next step? You say whole documents are used for countvectorizer inputs. Really thank you Maarten for your quick response, that means alot to me.

…

On Tue, Nov 26, 2024 at 8:54 PM Maarten Grootendorst < ***@***.***> wrote: I *really* urge you to follow the tutorial previously mentioned here <https://www.maartengrootendorst.com/blog/topicbert/>. Click that link and follow the tutorial. It will show you a step-by-step guide of how to implement the very basics of BERTopic. You can find all the answers there if you run that code and inspect the variables. I can give a brief answer to the following questions but you will have to follow the tutorial for the rest of it: My question is why do we perform tokenization and stop word removal after clustering? And not in the initial stage. Because we use embeddings, these are contextual and more accurate if you keep the original text intact. My main question is that in which data is used to countvectorizer process? Each cluster has a number of textual documents (these are strings). All documents (not the numeric values) are passed tot he CountVectorizer. In your previous response in github you say the countvectorizer is applied in integration of whole data, but clustering data is the numeric data after embedding and Tokenization performed. Please clarify me in this step for me. After clustering, I use the documents themselves not the embeddings. You can find a nice example of this in the tutorial I shared at the top, please follow it. At last please let me know with an example what exactly topic modeling output is. In my understanding if there is "football, cricket, table tennis, huckey" then this is represented by sports topics. That is correct. You can find more examples in the whole documentation, the many examples in the README and the link I provided you previously. — Reply to this email directly, view it on GitHub <#2226 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AM7ILK55TX3ZIAMOAJHHN6T2CSFL5AVCNFSM6AAAAABSPV6YDCVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMZYGQ4TKNI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

MaartenGr Nov 27, 2024
Maintainer

Have you checked the tutorial I sent you and run each step? That will give you the answer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How countvectorizor work in bertopic? I am also in dilemma #2226

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How countvectorizor work in bertopic? I am also in dilemma #2226

hari-chalise Nov 26, 2024

Replies: 3 comments · 2 replies

MaartenGr Nov 26, 2024 Maintainer

hari-chalise Nov 26, 2024 Author

MaartenGr Nov 26, 2024 Maintainer

hari-chalise Nov 26, 2024 Author

MaartenGr Nov 27, 2024 Maintainer

hari-chalise
Nov 26, 2024

Replies: 3 comments 2 replies

MaartenGr
Nov 26, 2024
Maintainer

hari-chalise
Nov 26, 2024
Author

MaartenGr Nov 26, 2024
Maintainer

hari-chalise
Nov 26, 2024
Author

MaartenGr Nov 27, 2024
Maintainer