evaluate and improve feed_seeker module #394

hroberts · 2018-05-31T21:40:24Z

The goal of feed discovery is to find the smallest possible set of feeds that return all of the stories from a given media source. So if a source has a single 'all stories' feed, we would ideally return just that feed. If a source has no 'all stories' feeds but has a bunch of feeds for independent sections, we should return all of those feeds. Getting a set of feeds that represents all stories for a source is the priority over returning the minimal number of feeds, but if there are shortcuts that return a single encompassing feed with high accuracy, we should use them.

We have a new rss feed discovery module that has begun work in this direction, but the developer for it has left. The code is mostly functional, but it is not clear to us how well it works either in the sense of running and not producing runtime errors or in the sense of producing an accurate list of rss feeds for a given site.

The new code is here:

https://github.com/mitmedialab/feed_seeker

The first task is to verify that the module works as documented and fix where it doesn't. We tried to use it recently and discovered that some sites made it run basically forever and that the timeout parameter does not in fact make it time out. The Daily Mail, for example, ran for four hours before we had to kill it.

The second task is develop a training and evaluation set that lets us validate the accuracy of the feed discovery. To generate the set, manually search for the feeds for a random sample of 50 sources in each of the following collections:

https://sources.mediacloud.org/#/collections/58722749
https://sources.mediacloud.org/#/collections/9272347

When manually searching for feeds, I have found the best strategies, in decreasing effectiveness, to be:

search google for ' rss' eg: 'nytimes.com rss'
search for the feed on feedly.com
manually look around the site for the rss feed

After generating that evaluation set, run the feed_seeker code to generate a set of feeds for each of the same source. Record how long it takes feed_seeker to discover the feeds for each media source.

Then do a best effort comparison of the manual vs. feed_seeker feeds for each source. For each source, indicate whether the feed_seeker feeds contain all, most, some, or none of the stories for the source. If there is no feed for the source, indicate 'no feed'. Separately indicate whether the feed_seeker results included feeds that return stories that do not belong to the source (for example of running feed_seeker on the nytimes.com returns a wapo rss feed). Generate precision and recall numbers for each of the above collections based on this evaluation.

After generating the initial accuracy metrics, try to improve feed_seeker to do a better job of discovering feeds. A couple of things that I suspect will improve the feed_seeker performance are:

try to guess using url semantics first instead of last (eg for nytimes.com guess nytimes.com/rss) and just use that single feed if it is parseable and is not empty (or maybe play with requiring some minimum number of stories to assume that it is a full feed).
use feedly.com to search for possible rss feeds. Last time I checked feedly provided this functionality in their free, un-authenticated api. And I commonly find in feedly rss feeds that are neither published anywhere I can see on a given site or found by the 'rss nytimes.com' google search. I assume that's because feedly is basically a crowd sourced discovery platform -- all it requires is one person to find the rss feed, and then it is in the sytem for everyone to find.

rahulbot · 2018-06-01T11:43:55Z

A few notes:

The work to generate a list of 50 random sources and manually discover all the RSS feeds you can is already done, by Anissa. The source data and her results are captured on validate new feed_scraper #333.
The task about checking feeds for "all, most, some, or now" of the stories is a little unclear to me. My guess is that it means:
a.pick a specific day and pull the static feeds files manual and feed_seeker
b. make two lists of story URLs for each site (one for urls from manual-discovered RSS feeds and one for url from feed_seeker-discoveed URLs)
c. compare the overall of those two lists for each source
Shouldn't this be an issue on the feed_seeker repo, as it doesn't use any media cloud data (except the static source lists already generated)
The final step should be to write up the evaluation as a blog post. This will be useful as a concrete output to respond to / comment on, and also as something we can refer people to going forward when they ask us about how we do/validate RSS discovery.

hroberts · 2018-06-01T14:32:37Z

Thanks for the comments.

I have added a 'fix the timeout issue' and an 'evaluate and improve accuracy' task to the feed_seeker issues board. The second task addresses your comments. Most importantly, it makes clear that I strongly think that best guess estimates of all/most/some/none are the best way to do this. Otherwise, I think it would become an endless task (story matching is hard!).

rahulbot · 2018-06-01T14:37:40Z

For reference, the new issues are:

hroberts closed this as completed Jun 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluate and improve feed_seeker module #394

evaluate and improve feed_seeker module #394

hroberts commented May 31, 2018

rahulbot commented Jun 1, 2018

hroberts commented Jun 1, 2018

rahulbot commented Jun 1, 2018

evaluate and improve feed_seeker module #394

evaluate and improve feed_seeker module #394

Comments

hroberts commented May 31, 2018

rahulbot commented Jun 1, 2018

hroberts commented Jun 1, 2018

rahulbot commented Jun 1, 2018