Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evaluate and improve accuracy of feed discovery #3

Open
hroberts opened this issue Jun 1, 2018 · 1 comment
Open

evaluate and improve accuracy of feed discovery #3

hroberts opened this issue Jun 1, 2018 · 1 comment
Assignees
Labels
good first issue Good for newcomers

Comments

@hroberts
Copy link
Collaborator

hroberts commented Jun 1, 2018

This task is to develop a training and evaluation set that lets us validate the accuracy of the feed discovery and then to imporove the heuristics of the feed_seeker module to improve the accuracy.

The goal of feed discovery is to find the smallest possible set of feeds that return all of the stories from a given media source. So if a source has a single 'all stories' feed, we would ideally return just that feed. If a source has no 'all stories' feeds but has a bunch of feeds for independent sections, we should return all of those feeds. Getting a set of feeds that represents all stories for a source is the priority over returning the minimal number of feeds, but if there are shortcuts that return a single encompassing feed with high accuracy, we should use them.

To judge the accuracy of the feed discovery process, use this set of manually discovered feeds:

mediacloud/backend#333

Run the feed_seeker code to generate a set of feeds for each of the sources in the above set. Record how long it takes feed_seeker to discover the feeds for each media source.

Then do a best effort comparison of the manual vs. feed_seeker feeds for each source. For each source, indicate whether the feed_seeker feeds contain all, most, some, or none of the stories for the source. If there is no feed for the source, indicate 'no feed'. Separately indicate whether the feed_seeker results included feeds that return stories that do not belong to the source (for example of running feed_seeker on the nytimes.com returns a wapo rss feed). Generate precision and recall numbers for each of the above collections based on this evaluation.

Just use a best guess eyeball estimate to determine the all/most/some/none score for each source. We don't need to try to directly compare lists of individual stories. Just eyeball the set of stories in the manually discovered feeds vs. the stories in the feed_seeker return feeds and make your best estimate about the feed_seeker coverage.

After generating the initial accuracy metrics, try to improve feed_seeker to do a better job of discovering feeds. A couple of things that I suspect will improve the feed_seeker performance are:

  • try to guess using url semantics first instead of last (eg for nytimes.com guess nytimes.com/rss) and just use that single feed if it is parseable and is not empty (or maybe play with requiring some minimum number of stories to assume that it is a full feed).

  • use feedly.com to search for possible rss feeds. Last time I checked feedly provided this functionality in their free, un-authenticated api. And I commonly find in feedly rss feeds that are neither published anywhere I can see on a given site or found by the 'rss nytimes.com' google search. I assume that's because feedly is basically a crowd sourced discovery platform -- all it requires is one person to find the rss feed, and then it is in the sytem for everyone to find.

The weakness of this approach is that we will be overfitting the heuristics to the particular set of 50 feeds above. To get a true accuracy score, we'll need to repeat the evaluation process with a new set of randomly sampled sources. Let's just do the initial evaluation and improvements first and then consider whether we want to do another full evaluation.

@hroberts
Copy link
Collaborator Author

hroberts commented Jun 1, 2018

The ultimate output of this work should be:

  • improvements to the feed_seeker module to improve its accuracy and
  • a blog post describing the evaluation process and results.

You will want to take notes throughout the process to make it easier to write the blog post at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants