Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

evaluate and improve feed_seeker module #394

Closed
hroberts opened this issue May 31, 2018 · 3 comments
Closed

evaluate and improve feed_seeker module #394

hroberts opened this issue May 31, 2018 · 3 comments

Comments

@hroberts
Copy link
Contributor

The goal of feed discovery is to find the smallest possible set of feeds that return all of the stories from a given media source. So if a source has a single 'all stories' feed, we would ideally return just that feed. If a source has no 'all stories' feeds but has a bunch of feeds for independent sections, we should return all of those feeds. Getting a set of feeds that represents all stories for a source is the priority over returning the minimal number of feeds, but if there are shortcuts that return a single encompassing feed with high accuracy, we should use them.

We have a new rss feed discovery module that has begun work in this direction, but the developer for it has left. The code is mostly functional, but it is not clear to us how well it works either in the sense of running and not producing runtime errors or in the sense of producing an accurate list of rss feeds for a given site.

The new code is here:

https://github.com/mitmedialab/feed_seeker

The first task is to verify that the module works as documented and fix where it doesn't. We tried to use it recently and discovered that some sites made it run basically forever and that the timeout parameter does not in fact make it time out. The Daily Mail, for example, ran for four hours before we had to kill it.

The second task is develop a training and evaluation set that lets us validate the accuracy of the feed discovery. To generate the set, manually search for the feeds for a random sample of 50 sources in each of the following collections:

https://sources.mediacloud.org/#/collections/58722749
https://sources.mediacloud.org/#/collections/9272347

When manually searching for feeds, I have found the best strategies, in decreasing effectiveness, to be:

  • search google for ' rss' eg: 'nytimes.com rss'
  • search for the feed on feedly.com
  • manually look around the site for the rss feed

After generating that evaluation set, run the feed_seeker code to generate a set of feeds for each of the same source. Record how long it takes feed_seeker to discover the feeds for each media source.

Then do a best effort comparison of the manual vs. feed_seeker feeds for each source. For each source, indicate whether the feed_seeker feeds contain all, most, some, or none of the stories for the source. If there is no feed for the source, indicate 'no feed'. Separately indicate whether the feed_seeker results included feeds that return stories that do not belong to the source (for example of running feed_seeker on the nytimes.com returns a wapo rss feed). Generate precision and recall numbers for each of the above collections based on this evaluation.

After generating the initial accuracy metrics, try to improve feed_seeker to do a better job of discovering feeds. A couple of things that I suspect will improve the feed_seeker performance are:

  • try to guess using url semantics first instead of last (eg for nytimes.com guess nytimes.com/rss) and just use that single feed if it is parseable and is not empty (or maybe play with requiring some minimum number of stories to assume that it is a full feed).

  • use feedly.com to search for possible rss feeds. Last time I checked feedly provided this functionality in their free, un-authenticated api. And I commonly find in feedly rss feeds that are neither published anywhere I can see on a given site or found by the 'rss nytimes.com' google search. I assume that's because feedly is basically a crowd sourced discovery platform -- all it requires is one person to find the rss feed, and then it is in the sytem for everyone to find.

@rahulbot
Copy link
Contributor

rahulbot commented Jun 1, 2018

A few notes:

  1. The work to generate a list of 50 random sources and manually discover all the RSS feeds you can is already done, by Anissa. The source data and her results are captured on validate new feed_scraper #333.
  2. The task about checking feeds for "all, most, some, or now" of the stories is a little unclear to me. My guess is that it means:
    a.pick a specific day and pull the static feeds files manual and feed_seeker
    b. make two lists of story URLs for each site (one for urls from manual-discovered RSS feeds and one for url from feed_seeker-discoveed URLs)
    c. compare the overall of those two lists for each source
  3. Shouldn't this be an issue on the feed_seeker repo, as it doesn't use any media cloud data (except the static source lists already generated)
  4. The final step should be to write up the evaluation as a blog post. This will be useful as a concrete output to respond to / comment on, and also as something we can refer people to going forward when they ask us about how we do/validate RSS discovery.

@hroberts
Copy link
Contributor Author

hroberts commented Jun 1, 2018

Thanks for the comments.

I have added a 'fix the timeout issue' and an 'evaluate and improve accuracy' task to the feed_seeker issues board. The second task addresses your comments. Most importantly, it makes clear that I strongly think that best guess estimates of all/most/some/none are the best way to do this. Otherwise, I think it would become an endless task (story matching is hard!).

@hroberts hroberts closed this as completed Jun 1, 2018
@rahulbot
Copy link
Contributor

rahulbot commented Jun 1, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants