evaluate and improve accuracy of feed discovery #3

hroberts · 2018-06-01T14:29:25Z

This task is to develop a training and evaluation set that lets us validate the accuracy of the feed discovery and then to imporove the heuristics of the feed_seeker module to improve the accuracy.

The goal of feed discovery is to find the smallest possible set of feeds that return all of the stories from a given media source. So if a source has a single 'all stories' feed, we would ideally return just that feed. If a source has no 'all stories' feeds but has a bunch of feeds for independent sections, we should return all of those feeds. Getting a set of feeds that represents all stories for a source is the priority over returning the minimal number of feeds, but if there are shortcuts that return a single encompassing feed with high accuracy, we should use them.

To judge the accuracy of the feed discovery process, use this set of manually discovered feeds:

mediacloud/backend#333

Run the feed_seeker code to generate a set of feeds for each of the sources in the above set. Record how long it takes feed_seeker to discover the feeds for each media source.

Then do a best effort comparison of the manual vs. feed_seeker feeds for each source. For each source, indicate whether the feed_seeker feeds contain all, most, some, or none of the stories for the source. If there is no feed for the source, indicate 'no feed'. Separately indicate whether the feed_seeker results included feeds that return stories that do not belong to the source (for example of running feed_seeker on the nytimes.com returns a wapo rss feed). Generate precision and recall numbers for each of the above collections based on this evaluation.

Just use a best guess eyeball estimate to determine the all/most/some/none score for each source. We don't need to try to directly compare lists of individual stories. Just eyeball the set of stories in the manually discovered feeds vs. the stories in the feed_seeker return feeds and make your best estimate about the feed_seeker coverage.

After generating the initial accuracy metrics, try to improve feed_seeker to do a better job of discovering feeds. A couple of things that I suspect will improve the feed_seeker performance are:

try to guess using url semantics first instead of last (eg for nytimes.com guess nytimes.com/rss) and just use that single feed if it is parseable and is not empty (or maybe play with requiring some minimum number of stories to assume that it is a full feed).
use feedly.com to search for possible rss feeds. Last time I checked feedly provided this functionality in their free, un-authenticated api. And I commonly find in feedly rss feeds that are neither published anywhere I can see on a given site or found by the 'rss nytimes.com' google search. I assume that's because feedly is basically a crowd sourced discovery platform -- all it requires is one person to find the rss feed, and then it is in the sytem for everyone to find.

The weakness of this approach is that we will be overfitting the heuristics to the particular set of 50 feeds above. To get a true accuracy score, we'll need to repeat the evaluation process with a new set of randomly sampled sources. Let's just do the initial evaluation and improvements first and then consider whether we want to do another full evaluation.

hroberts · 2018-06-01T14:30:45Z

The ultimate output of this work should be:

improvements to the feed_seeker module to improve its accuracy and
a blog post describing the evaluation process and results.

You will want to take notes throughout the process to make it easier to write the blog post at the end.

rahulbot mentioned this issue Jun 1, 2018

evaluate and improve feed_seeker module mediacloud/backend#394

Closed

pypt mentioned this issue Mar 25, 2019

GSoC 2019 ideas mediacloud/backend#568

Closed

pushshift self-assigned this May 7, 2019

pushshift added the good first issue Good for newcomers label May 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluate and improve accuracy of feed discovery #3

evaluate and improve accuracy of feed discovery #3

hroberts commented Jun 1, 2018

hroberts commented Jun 1, 2018

evaluate and improve accuracy of feed discovery #3

evaluate and improve accuracy of feed discovery #3

Comments

hroberts commented Jun 1, 2018

hroberts commented Jun 1, 2018