GitHub - mrisdal/bs-news-scraper: Code and documentation for the Getting Real about Fake News Kaggle dataset.

Overview

This repository contains documentation for the Getting Real about Fake News dataset published on Kaggle.

Dataset Description

The latest hot topic in the news is fake news and many are wondering what data scientists can do to detect it and stymie its viral spread. This dataset is only a first step in understanding and tackling this problem. It contains text and metadata scraped from 244 websites tagged as "bullshit" here by the BS Detector Chrome Extension by Daniel Sieradski.

Warning: I did not modify the list of news sources from the BS Detector so as not to introduce my (useless) layer of bias; I'm not an authority on fake news. There may be sources whose inclusion you disagree with. It's up to you to decide how to work with the data and how you might contribute to "improving it". The labels of "bs" and "junksci", etc. do not constitute capital "t" Truth. If there are other sources you would like to include, start a discussion. If there are sources you believe should not be included, start a discussion or write a kernel analyzing the data. Or take the data and do something else productive with it. Kaggle's choice to host this dataset is not meant to express any particular political affiliation or intent.

The dataset contains text and metadata from 244 websites and represents 12,999 posts in total from the past 30 days. The data was pulled using the webhose.io API; because it's coming from their crawler, not all websites identified by the BS Detector are present in this dataset. Each website was labeled according to the BS Detector as documented here. Data sources that were missing a label were simply assigned a label of "bs". There are (ostensibly) no genuine, reliable, or trustworthy news sources represented in this dataset (so far), so don't trust anything you read.

Fake news in the news

For inspiration, I've included some (presumably non-fake) recent stories covering fake news in the news. This is a sensitive, nuanced topic and if there are other resources you'd like to see included here, please leave a suggestion. From defining fake, biased, and misleading news in the first place to deciding how to take action (a blacklist is not a good answer), there's a lot of information to consider beyond what can be neatly arranged in a CSV file.

Improvements

If you have suggestions for improvements or would like to contribute, please let me know. The most obvious extension is to include data from "real" news sites. Even a list of authentic news sources would be helpful to the project. I'd be happy to include any contributions in future versions of the dataset.

Acknowledgements

Thanks to Anthony for pointing me to Daniel Sieradski's BS Detector. Thank you to Daniel Nouri for encouraging me to add a disclaimer to the dataset's page.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
bs_news_urls.json		bs_news_urls.json
get_news_data.R		get_news_data.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Dataset Description

Contents

Fake news in the news

Improvements

Acknowledgements

About

Releases

Packages

Languages

mrisdal/bs-news-scraper

Folders and files

Latest commit

History

Repository files navigation

Overview

Dataset Description

Contents

Fake news in the news

Improvements

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages