Read Like You Tweet

A New York Times Article Recommendation System Based On Your Twitter Timeline

http://readlikeyoutweet.herokuapp.com/

Overview (a very detailed description of the recommender can be found in underthehood.ipynb)

This project was my final project for General Assembly's Data Science class in New York City in summer 2015. It hasn't been updated or modified since then, but I try to keep the app running. The idea is to recommend New York Times articles to Twitter users based on their tweets. This is established in the following way:

I downloaded over 100.000 article snippets from the New York Times Article Search API and categorized them according to their sections
I vectorized the text and created text features with a term frequency-inverse document frequency vectorizer
I trained a multiclass Logistic Regression classifier to identify the classes

The above happened "offline". Similarly as the words in an article indicate the section it belongs to, the same words in tweets are likely to indicate that the Twitter user is interested in news from this section. Therefore, the obtained model can be used to predict a Twitter user's interests.

The program/website does the following:

A Twitter user provides her or his Twitter handle
With the Twitter API the 100 latest tweets are downloaded
These tweets are processed and vectorized as the article data before and feeded into the Logistic Regression model
This should, hopefully, yield the category the user may want to read news from

The final step:

Connect to the New York Times Top Stories API
Fetch the top story articles from the section which was predicted by the classifier. This usually yields 30 articles from this section
Calculate the Jaccard distance between these articles and the user's tweets
Recommend the closest article to the Twitter user

Possible further modifications

There are many possible improvements and extensions:

Try to fit a stronger model, possibly using other classifiers
Use dimensionality reduction or clustering techniques to gain further insights and/or reduce features
Predict several probable labels and do not only recommend from one section but from several probable ones
Scrape whole articles using webscraping tools like beautifulsoup to get whole articles instead of only headlines, snippets and keywords. This could maybe help when training the algorithm and when calculating the Jaccard distances
Include further newspapers other than the New York Times, both for model training as well as recommendation (use for example also the Guardian, which also has a great API framework)
Check where the user comes from (UK, US, Australia) and recommend either from NYT/Guardian US, Guardian UK, or Guardian Australia
Extend the system beyond targeting only English twitterers and recommending only English newspaper articles
Try to get even more user information, for example from Facebook, LinkedIn, etc., to make even better recommendations
Also note that in order to stay accurate the system needs to be retrained every now and then with more recent NYT article data (it has never been retrained since 2015).

Files

src/articles.py: Downloads the training data via the New York Times Article Search API
src/datapreparation.py: Cleans the data
src/algorithm.py: Parametrizes the tf-idf vectorizer and fits the Logistic Regression model
src/predictor.py: Connects to Twitter and New York Times Top Stories API and recommends articles to Twitter users (needs Twitter handle as command line input)
underthehood.ipynb: Discusses the engine in detail and shows a few data and model visualizations as well as numbers
readlikeyoutweet_schematic.png: Schematic visualization of the recommender's workflow
website/...: Website code to implement and run the model as a heroku app in the web using flask (http://readlikeyoutweet.herokuapp.com/)

Note that I did not upload the actual datasets, the pickled logistic regression model, the pickled tfidf vectorizer and the pickled stopwords (for the website also the stopwords need to be pickled). However, with the code the data can be downloaded again and the models parametrized again.

Furthermore, note that the whole code naturally requires API keys for all involved APIs to work.

Have fun!

Karsten

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Read Like You Tweet

A New York Times Article Recommendation System Based On Your Twitter Timeline

http://readlikeyoutweet.herokuapp.com/

Overview (a very detailed description of the recommender can be found in underthehood.ipynb)

Possible further modifications

Files

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
src		src
website		website
README.md		README.md
readlikeyoutweet_schematic.png		readlikeyoutweet_schematic.png
underthehood.ipynb		underthehood.ipynb

karstenkreis/ReadLikeYouTweet

Folders and files

Latest commit

History

Repository files navigation

Read Like You Tweet

A New York Times Article Recommendation System Based On Your Twitter Timeline

http://readlikeyoutweet.herokuapp.com/

Overview (a very detailed description of the recommender can be found in underthehood.ipynb)

Possible further modifications

Files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages