Skip to content

Recommending News Articles to Twitter Users based on their Tweets

Notifications You must be signed in to change notification settings

karstenkreis/ReadLikeYouTweet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Read Like You Tweet

A New York Times Article Recommendation System Based On Your Twitter Timeline

Overview (a very detailed description of the recommender can be found in underthehood.ipynb)

This project was my final project for General Assembly's Data Science class in New York City in summer 2015. It hasn't been updated or modified since then, but I try to keep the app running. The idea is to recommend New York Times articles to Twitter users based on their tweets. This is established in the following way:

  • I downloaded over 100.000 article snippets from the New York Times Article Search API and categorized them according to their sections
  • I vectorized the text and created text features with a term frequency-inverse document frequency vectorizer
  • I trained a multiclass Logistic Regression classifier to identify the classes

The above happened "offline". Similarly as the words in an article indicate the section it belongs to, the same words in tweets are likely to indicate that the Twitter user is interested in news from this section. Therefore, the obtained model can be used to predict a Twitter user's interests.

The program/website does the following:

  • A Twitter user provides her or his Twitter handle
  • With the Twitter API the 100 latest tweets are downloaded
  • These tweets are processed and vectorized as the article data before and feeded into the Logistic Regression model
  • This should, hopefully, yield the category the user may want to read news from

The final step:

  • Connect to the New York Times Top Stories API
  • Fetch the top story articles from the section which was predicted by the classifier. This usually yields 30 articles from this section
  • Calculate the Jaccard distance between these articles and the user's tweets
  • Recommend the closest article to the Twitter user

Possible further modifications

There are many possible improvements and extensions:

  • Try to fit a stronger model, possibly using other classifiers
  • Use dimensionality reduction or clustering techniques to gain further insights and/or reduce features
  • Predict several probable labels and do not only recommend from one section but from several probable ones
  • Scrape whole articles using webscraping tools like beautifulsoup to get whole articles instead of only headlines, snippets and keywords. This could maybe help when training the algorithm and when calculating the Jaccard distances
  • Include further newspapers other than the New York Times, both for model training as well as recommendation (use for example also the Guardian, which also has a great API framework)
  • Check where the user comes from (UK, US, Australia) and recommend either from NYT/Guardian US, Guardian UK, or Guardian Australia
  • Extend the system beyond targeting only English twitterers and recommending only English newspaper articles
  • Try to get even more user information, for example from Facebook, LinkedIn, etc., to make even better recommendations
  • Also note that in order to stay accurate the system needs to be retrained every now and then with more recent NYT article data (it has never been retrained since 2015).

Files

Note that I did not upload the actual datasets, the pickled logistic regression model, the pickled tfidf vectorizer and the pickled stopwords (for the website also the stopwords need to be pickled). However, with the code the data can be downloaded again and the models parametrized again.

Furthermore, note that the whole code naturally requires API keys for all involved APIs to work.

Have fun!

Karsten

About

Recommending News Articles to Twitter Users based on their Tweets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published