Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Top words for early 2022 are all Vietnamese? #78

Open
pgulley opened this issue Jun 24, 2024 · 3 comments
Open

Top words for early 2022 are all Vietnamese? #78

pgulley opened this issue Jun 24, 2024 · 3 comments
Assignees
Labels
bug Something isn't working question Further information is requested
Milestone

Comments

@pgulley
Copy link
Member

pgulley commented Jun 24, 2024

image

@philbudne Noted, in investigating the status of re-indexing data from 2022, that the top-terms for a query from 2022-01-01 to 2022-12-31 seems to be entirely populated with Vietnamese words- despite vietnamese not being in the top 10 languages represented!

@pgulley pgulley added bug Something isn't working question Further information is requested labels Jun 24, 2024
@pgulley pgulley self-assigned this Jun 26, 2024
@pgulley
Copy link
Member Author

pgulley commented Jul 3, 2024

No Vietnamese stopwords might be part of the issue, but probably doesn't cover this

@pgulley pgulley added this to the July milestone Jul 3, 2024
@pgulley pgulley moved this from Todo to Investigating in Ingest + Index Infrastructure Jul 3, 2024
@pgulley pgulley modified the milestones: 2 - July, 3 - August Jul 31, 2024
@pgulley pgulley modified the milestones: 3 - August, 4 - September Aug 28, 2024
@philbudne
Copy link
Contributor

It may just be because I run queries against all stories when looking at progress running historical backfills, and we have some REALLY spammy .vn sources!!

@philbudne
Copy link
Contributor

philbudne commented Nov 3, 2024

Was traipsing thru mc-providers and noticed that there is no vi_stop_words.txt file in https://github.com/mediacloud/mc-providers/tree/main/mc_providers/language/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
Status: Investigating
Development

No branches or pull requests

2 participants