Retweetability Prediction

This work was submitted in partial fulfillment of Columbia University Master's Program

Original Thesis Title : The Use of Domain-Specific Sentiment Analysis on Predicting Information Diffusion in Online Social Networks

Short Summary

In this project, I built machine learning models (logistic regression, k nearest neighbor, random forests, and XGBoost) including domain-specific sentiment analysis that classify whether a message is retweetable or not --- with the highest AUC of 0.84.

Previous Work

This project was inspired by previous research projects I was involved in. If interested, here are some research findings that I presented at national social science conferences:

March 2018, the Society for Personality and Social Psychology Annual Meeting
May 2019, the Association for Psychological Science Annual Convention

Motivation

As online social media networks have become a major platform for sharing one’s opinions, there is a growing need for building an accurate predictive model for information diffusion. Taking insights from social science research (See the previous work for more details), I hoped to build a prediction model that incorporates a number of social factors associated with sharing political content in online social networks.

In addition, in continuation of my previous work regarding domain-specific sentiment analysis, I compared the performances between models employing different dictionaries:

General, binary sentiment-based model ( positive or negative words )
Moral-emotional sentiment-based model ( moral-emotional words such as peace and punish )
Outrage-fear sentiment-based model

Data

The scraped tweets were posted by all 100 U.S. Senators during the year leading up to the 2016 U.S. election: from November 2015 to October 2016. (n = 99,750)

Features

The number of followers
URL and Media Attachment
Political ideology scores
Sentiment Analysis
Gender
Social support

Target

The main task was to classify whether a tweet is retweetable or not. In defining what a retweetable message is, I referred to the median retweet counts in the dataset, which was 6. Thus, Twitter messages retweeted 6 times or more were classified as 'retweetable', and the other were categorized as 'not retweetable'.

Heatmap for the Correlation Matrix for All Features

Feature Importance Plot for the Best Model

importance type='weight'

Performance

: The XGBoost models using the outrage-fear dictionary rendered the highest AUC of 0.8393.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
images		images
Jino 2018 SPSP poster.pdf		Jino 2018 SPSP poster.pdf
Jino 2019 APS poster.pdf		Jino 2019 APS poster.pdf
New_Jino_Kwon_Thesis_Code_Refactored.py		New_Jino_Kwon_Thesis_Code_Refactored.py
Old_Jino_Kwon_Thesis_Code.ipynb		Old_Jino_Kwon_Thesis_Code.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retweetability Prediction

This work was submitted in partial fulfillment of Columbia University Master's Program

Short Summary

Previous Work

Motivation

Data

Features

Target

Heatmap for the Correlation Matrix for All Features

Feature Importance Plot for the Best Model

Performance

About

Releases

Packages

Languages

jino-kwon/Retweetability_Prediction

Folders and files

Latest commit

History

Repository files navigation

Retweetability Prediction

This work was submitted in partial fulfillment of Columbia University Master's Program

Short Summary

Previous Work

Motivation

Data

Features

Target

Heatmap for the Correlation Matrix for All Features

Feature Importance Plot for the Best Model

Performance

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages