-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Topic Models] Understand various shades of “Other/None of the Above” #1876
Comments
Great points here @jucor - this is an area we've been looking to improve recently also. We've also been considering that:
Curious if in your testing you've come across typical comment types that fit the "algorithm could not determine" type category you mentioned? |
Hi @cianbrassilg !
Terrific :)
That makes sense to me! Might be worth then detecting it and issuing a warning at the end, sort of "You might want to re-run with automatic discovery of categories".
Agreed.
Yes, for example when I ran the BG2018-short "2018 BG with vote tallies (filtered) - comments-with-votes-small" example spreadsheet provided by @metasoarous : more than 70% of the comments ended in "algorithm could not determine". I suspect (but did not verify) that's because the spreadsheet had a lot of comments that were not filtered out but whose content had been deleted. I remember also @DZNarayanan mentioning that "Other" is often pretty big, and one of your team mentioning that it's often the biggest category -- so as we're investigating why, I think ruling out "Algorithm could not determine" would be the first thing to check for (and since doing it automatically is just a code change, that'd be easier than doing it manually). |
Hi @cianbrassilg, It would helpful if you can modify the code to mark which statements in the "Other" category fall under "algorithm could not determine” and which ones "none of the above." That will make it easier to find patterns and then figure out how to reduce the size of the "Other" category. Thanks. |
|
Most topic classifications, including humans, have an “Other” class, which humans mean as “None of the above”. Some topic models explicitly model for one or several “noise” classes.
When we use a topic model library, we will want to ensure that the “Other” class does match our human users expectation. In particular, some libraries use “Other” as a catch-all which includes both “none of the above” and “algorithm could not determine”. While this can seem a subtle nuance, there is a difference:
That level of detail then allows us to better understand what are our failure cases, what we cover and what might be missing from those topic classifications. It becomes even more important if the “Other” category is large.
The text was updated successfully, but these errors were encountered: