-
Dear Author, Hello! I am a master's student using BERTopic for bibliometric analysis, with the theme of understanding how AI can help the development of medicine. - I should only obtain 6-8 larger TOPICs, so that they can be better understood However,in my opinion - I should obtain more small TOPICs (20+) and then summarize them into 6-7 large TOPICs through hierarchical modeling or manual analysis Dear developers and researchers with bibliometric experience, could you please take a look and see who is more accurate?Thank you! Example of BIG TOPIC ("Medical Image Processing") |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Let me start by saying that I'm certainly not going to go between your supervisor and yourself. I don't think that's a particularly helpful way of approaching a problem like this. Rather, I believe it's best to highlight the different perspectives that you can take and evaluate them according to the standards/metrics/philosophy that fits best with the particular use case or application. That said, there are two things mentioned. The number and size of topicsThere is something to be said for both views. If you directly get 6-7 large topics, then they are typically modeled directly using the underlying clustering (which is not the case with the hierarchical approach). Moreover, from a psychological perspective, sometimes having fewer topics helps readers/users understand more easily what the corpus is about. Information overload is a real thing that shouldn't be underestimated. Smaller topics, and then using hierarchical topic modeling, is a nice strategy for getting multi-level topics. It gives you a more thorough understand of high level topics by looking at the smaller topics they contain. Because the strategy for merging topics is just finding the best (even among all poor choices), you cannot be certain that 6-7 topics you obtain is the same quality as those you would get by directly modeling 6-7 topics. You can, however, more easily identify when the model should stop merging topics. This is an important feature as 6-7 topics might not be ideal. What about 15? Or 45? With hierarchical topic modeling, you can more easily identify this. OutliersDocuments not classified isn't inherently good or bad and again depends on both the use case as well as your perspective. From a topic representation perspective (i.e., what are the names of the topics?), I see no problems when you ignore a part of the corpus since we are only interested in representing the topics and naming them. For topic assignment (i.e., which topic belongs to which document), I can definitely imagine wanting to reduce the number of outliers, especially if those documents are quite important. But here's the thing, when you run As such, you can easily just reduce outliers after |
Beta Was this translation helpful? Give feedback.
Let me start by saying that I'm certainly not going to go between your supervisor and yourself. I don't think that's a particularly helpful way of approaching a problem like this. Rather, I believe it's best to highlight the different perspectives that you can take and evaluate them according to the standards/metrics/philosophy that fits best with the particular use case or application.
That said, there are two things mentioned.
The number and size of topics
There is something to be said for both views.
If you directly get 6-7 large topics, then they are typically modeled directly using the underlying clustering (which is not the case with the hierarchical approach). Moreover, from a psyc…