How countvectorizor work in bertopic? I am also in dilemma #2226
hari-chalise
started this conversation in
General
Replies: 3 comments 2 replies
-
The CountVectorizer is used for the clustered data. I would advise looking through the documentation here and highly advise going through the tutorial here as I shared before in a previous issue you opened with the same question. This will give you the intuition you need since you'll implement it yourself. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Dear MaartenGr/BERTopic, I found and read your document. I am confused. I
don't know either. I have no capacity to grabe knowledge from your clear
document. My question is why do we perform tokenization and stop word
removal after clustering? And not in the initial stage. And how
tokenization actually performs in this stage. We already have numeric or
vector data in embedding dimensionality reduction and clustering. But
tokenization is performed in original text data and stop words are also in
text. Please clarify these points for me and if possible please clarify me
with examples between clustering and topic representation.
My main question is that in which data is used to countvectorizer process?
In your previous response in github you say the countvectorizer is applied
in integration of whole data, but clustering data is the numeric data after
embedding and Tokenization performed. Please clarify me in this step for me.
I have a Proposal defense after one days so i need to clear. At last please
let me know with an example what exactly topic modeling output is. In my
understanding if there is "football, cricket, table tennis, huckey" then
this is represented by sports topics.
With regards
Hari Lal Chalise
MSC.CSIT
TU Nepal
…On Tue, Nov 26, 2024 at 1:38 PM Maarten Grootendorst < ***@***.***> wrote:
The CountVectorizer is used for the clustered data. I would advise looking
through the documentation here
<https://maartengr.github.io/BERTopic/algorithm/algorithm.html> and
highly advise going through the tutorial here
<https://www.maartengrootendorst.com/blog/topicbert/> as I shared before
in a previous issue you opened with the same question. This will give you
the intuition you need since you'll implement it yourself.
—
Reply to this email directly, view it on GitHub
<#2226 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AM7ILKYMUYHCKCVBTNFTWSD2CQSHDAVCNFSM6AAAAABSPV6YDCVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMZYGA3DQNI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
1 reply
-
Thank you for the clarification.
But, if you use the documents after clustering then how embedding is used
in the whole process? please how exactly these steps interrelated each
other? If embedding is used for dimensionality reduction and clustering
then when and how these embedding are used in the next step? You say whole
documents are used for countvectorizer inputs.
Really thank you Maarten for your quick response, that means alot to me.
…On Tue, Nov 26, 2024 at 8:54 PM Maarten Grootendorst < ***@***.***> wrote:
I *really* urge you to follow the tutorial previously mentioned here
<https://www.maartengrootendorst.com/blog/topicbert/>. Click that link
and follow the tutorial. It will show you a step-by-step guide of how to
implement the very basics of BERTopic. You can find all the answers there
if you run that code and inspect the variables.
I can give a brief answer to the following questions but you will have to
follow the tutorial for the rest of it:
My question is why do we perform tokenization and stop word
removal after clustering? And not in the initial stage.
Because we use embeddings, these are contextual and more accurate if you
keep the original text intact.
My main question is that in which data is used to countvectorizer process?
Each cluster has a number of textual documents (these are strings). All
documents (not the numeric values) are passed tot he CountVectorizer.
In your previous response in github you say the countvectorizer is applied
in integration of whole data, but clustering data is the numeric data after
embedding and Tokenization performed. Please clarify me in this step for
me.
After clustering, I use the documents themselves not the embeddings. You
can find a nice example of this in the tutorial I shared at the top, please
follow it.
At last please
let me know with an example what exactly topic modeling output is. In my
understanding if there is "football, cricket, table tennis, huckey" then
this is represented by sports topics.
That is correct. You can find more examples in the whole documentation,
the many examples in the README and the link I provided you previously.
—
Reply to this email directly, view it on GitHub
<#2226 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AM7ILK55TX3ZIAMOAJHHN6T2CSFL5AVCNFSM6AAAAABSPV6YDCVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMZYGQ4TKNI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have questions after embedding data are converted into vector then this is converted in to low dimenaion by reduction and clustered them. But in bertopic documentation after clustering countvectorizor is used for tokenization and stop words removing. My question is that how to do so? You answer me countvectorizor used for whole document but is clustered data not used in tokenization? Or how to interrelate clustering into next step i am so confused. @MaartenGr please clearify me with example .
Beta Was this translation helpful? Give feedback.
All reactions