Skip to content

find/fix cases where source with pub_state isn't in appropriate state collection (US only) #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rahulbot opened this issue Dec 6, 2024 · 4 comments · May be fixed by #17
Open

Comments

@rahulbot
Copy link

rahulbot commented Dec 6, 2024

We have at least one source that has a has a pub_state=US-MA (Athol Daily News) but is not included in the Massachusetts, United States - State & Local collection (I just fixed this manually). I imagine this is either (a) oversight from when it was created or (b) error from the massive media merge.

It wouldn't be too hard to run a script to find and list these for manual fixing, something like:

build a manual lookup able that maps from state code to collection
for each media _source with a pub_state in 50 us options (US-MA, US-AK, etc):
    check the lookup table for the collection is _should_ be in
    if the source isn't in that collection, add it to a list with the source id and collection id

Once we review that list it could be processed automatically by another script to make fixes (if that is possible via API).

@rahulbot rahulbot added the enhancement New feature or request label Dec 6, 2024
@rahulbot
Copy link
Author

rahulbot commented Dec 6, 2024

Slight overlap with topic of #2, but not quite close enough to tackle together.

@rahulbot
Copy link
Author

I saw the first pass results. This is far more than I had hoped -- my initial thought was to review and then automatically add them. However, we can't review 1300+ manually.

Can you add stories-per-week to this CSV so that we can prioritize by volume of content?

@m453h
Copy link
Contributor

m453h commented Apr 2, 2025

@rahulbot I have updated the script and this is the output

@rahulbot
Copy link
Author

rahulbot commented Apr 2, 2025

Moving to me for next step. We will manually review the 205 that have more than one story per week and mark if the correct_collection_name is right or not. If that all looks good I'll bump it back over to tech team to run the fix in batch.

updated_sources_not_in_any_collection_with_stories.csv

@rahulbot rahulbot assigned rahulbot and unassigned m453h Apr 2, 2025
@pgulley pgulley added Paused data-quality and removed enhancement New feature or request labels Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants