-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add custom labels to get topic tree #2125
base: master
Are you sure you want to change the base?
Add custom labels to get topic tree #2125
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! The code looks solid and I left minor comments.
One other thing, would you mind expanding the tests at tests/test_variations/test_hierarchy
(test_tree
) a bit to include the custom_labels=True
?
if isinstance(custom_labels, str): | ||
for topic, kws_info in self.topic_aspects_[custom_labels].items(): | ||
label = "_".join([kw[0] for kw in kws_info[:5]]) # displaying top 5 kws | ||
left_labels[topic] = label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a check for when a user attempts to use a specific aspect but it is not found in self.topic_aspects_
? I can imagine that a user (myself included) would mistype the name and the resulting error might not be entirely clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, I can add a check here. One thing I noticed when looking through how custom labels were handled in the plotting functions is that they all follow a pretty similar pattern:
BERTopic/bertopic/plotting/_topics.py
Line 67 in 510c15e
if isinstance(custom_labels, str): |
BERTopic/bertopic/plotting/_heatmap.py
Line 99 in 510c15e
if isinstance(custom_labels, str): |
BERTopic/bertopic/plotting/_datamap.py
Line 123 in 510c15e
if isinstance(custom_labels, str): |
if isinstance(custom_labels, str): |
BERTopic/bertopic/plotting/_documents.py
Line 139 in 510c15e
if isinstance(custom_labels, str): |
Maybe a more generic and reusable function can be created for this, which would check if the user specified aspect exists in self.topic_aspects_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it makes no sense to do that check here but not at all of those other instances. For now, I'm alright either way. I can imagine wanting to stick with what I already use everywhere else for consistency as this might be a bit out of the scope of this particular PR. So it's up to you.
label = "_".join([kw[0] for kw in kws_info[:5]]) # displaying top 5 kws | ||
left_labels[topic] = label | ||
elif self.custom_labels_ is not None and custom_labels: | ||
left_labels = {topic_id: label for topic_id, label in enumerate(self.custom_labels_, -self._outliers)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wanted to mention that I am going to steal that enumerate trick. That is so much nicer coding than how I have approached it thus far.
left_labels = {} | ||
if isinstance(custom_labels, str): | ||
for topic, kws_info in self.topic_aspects_[custom_labels].items(): | ||
label = "_".join([kw[0] for kw in kws_info[:5]]) # displaying top 5 kws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that some topic aspects might not have 5 keywords but actually a single label. In those cases, the kws_info[0]
is likely to be filled with a label (or even a summary) and each instance in kws_info[1:]
will be an empty string. As a result, you might get the following label: "artificial intelligence____".
What does this PR do?
get_topic_tree
function currently does not accept custom_labels as a parameter whereas the other visualization methods do. Therefore this change aims to fix that issue so custom labels can be accepted and make the visualization more intuitive.I really enjoyed looking through the codebase here. My first PR here so still learning the project structure, would appreciate any feedback. Thanks!
Fixes # #2123
Before submitting