Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Topic Models] Understand various shades of “Other/None of the Above” #1876

Open
jucor opened this issue Jan 21, 2025 · 4 comments
Open
Labels
feature-request For new feature suggestions

Comments

@jucor
Copy link
Contributor

jucor commented Jan 21, 2025

Most topic classifications, including humans, have an “Other” class, which humans mean as “None of the above”. Some topic models explicitly model for one or several “noise” classes.
When we use a topic model library, we will want to ensure that the “Other” class does match our human users expectation. In particular, some libraries use “Other” as a catch-all which includes both “none of the above” and “algorithm could not determine”. While this can seem a subtle nuance, there is a difference:

  • “None of the above” is a property of the comment: the classifier looked at all the above and said “this does not belong to any of them”. For an automated classifier that can mean “I’m sure that it does not belong”.
  • “Algorithm could not determine” is a property of the algorithm. It could for example be a mixture algorithm that fails to converge, or that it cannot handle some of the text in the comment, or that there has been an unexpected error.

That level of detail then allows us to better understand what are our failure cases, what we cover and what might be missing from those topic classifications. It becomes even more important if the “Other” category is large.

@cianbrassilg
Copy link

cianbrassilg commented Jan 22, 2025

Great points here @jucor - this is an area we've been looking to improve recently also. We've also been considering that:

  • If the user is providing the categories, then a larger other/none category may make sense, depending on categories provided, if many comments don't logically fit.
  • If the model is generating the categories, then we should likely expect significantly less 'other' categorized comments.

Curious if in your testing you've come across typical comment types that fit the "algorithm could not determine" type category you mentioned?

@jucor
Copy link
Contributor Author

jucor commented Jan 22, 2025

Hi @cianbrassilg !
[Note: shall we move the discussion specific to Jigsaw's sensemaking-tools into the issue opened in the repository specific to that product/library ? here: Jigsaw-Code/sensemaking-tools#10 ]

Great points here @jucor - this is an area we've been looking to improve recently also.

Terrific :)

We've also been considering that:

  • If the user is providing the categories, then a larger other/none category may make sense, depending on categories provided, if many comments don't logically fit.

That makes sense to me! Might be worth then detecting it and issuing a warning at the end, sort of "You might want to re-run with automatic discovery of categories".

  • If the model is generating the categories, then we should likely expect significantly less 'other' categorized comments.

Agreed.

Curious if in your testing you've come across typical comment types that fit the "algorithm could not determine" type category you mentioned?

Yes, for example when I ran the BG2018-short "2018 BG with vote tallies (filtered) - comments-with-votes-small" example spreadsheet provided by @metasoarous : more than 70% of the comments ended in "algorithm could not determine". I suspect (but did not verify) that's because the spreadsheet had a lot of comments that were not filtered out but whose content had been deleted.

I remember also @DZNarayanan mentioning that "Other" is often pretty big, and one of your team mentioning that it's often the biggest category -- so as we're investigating why, I think ruling out "Algorithm could not determine" would be the first thing to check for (and since doing it automatically is just a code change, that'd be easier than doing it manually).

@DZNarayanan
Copy link
Collaborator

Hi @cianbrassilg,

It would helpful if you can modify the code to mark which statements in the "Other" category fall under "algorithm could not determine” and which ones "none of the above." That will make it easier to find patterns and then figure out how to reduce the size of the "Other" category.

Thanks.

@cianbrassilg
Copy link

cianbrassilg commented Jan 22, 2025

shall we move the discussion specific to Jigsaw's sensemaking-tools into the issue opened in the repository specific to that product/library ?
@jucor Yes, sounds good! We can continue there 👍

mark which statements in the "Other" category fall under "algorithm could not determine” and which ones "none of the above."
@DZNarayanan Good suggestion, will discuss with the team here also

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request For new feature suggestions
Projects
None yet
Development

No branches or pull requests

3 participants