Skip to content

Latest commit

 

History

History
134 lines (86 loc) · 7.47 KB

README.md

File metadata and controls

134 lines (86 loc) · 7.47 KB

Sentiment Search: Make the Internet your Focus Group

Visualizing the sentiment of the Internet across multiple media platforms.

GitHub repo: https://github.com/farisdurrani/sentiment-search
Implemented: Fall 2022

How to Run

Prerequisites:

  1. At least 16 GB RAM (recommended 24 GB)
  2. NodeJS v18
  3. NPM v9: npm install npm@latest -g
  4. Python 3.10
  5. Download the reduced_data dataset from Kaggle, unzipping it, and putting all the files inside /backend

Start both the backend and frontend apps once inside /Project:

Note the different backend commands for Mac/Linux and Windows

Action Command
Install / Update Frontend Packages (one-time) npm run installf
Install / Update Backend Packages (one-time) (linux/mac) npm run installb-mac
Install / Update Backend Packages (one-time) (windows) npm run installb-win
Running the Frontend npm run startf
Running the Backend (linux/mac) npm run startb-mac
Running the Backend (windows) npm run startb-win

Note It may take a minute for the app to show the graphs to start showing on each page load, and a few minutes for the backend to fully start. The message None of PyTorch, TensorFlow >= 2.0, or Flax have been found... can be safely ignored in anticipation of * Debugger PIN: ... which indicates the backend has fully started. If the frontend starts first before the backend, refresh the page so the frontend calls the backend again.

Description and Architecture

Using D3.js and Apexcharts, we compared the changing sentiments over time on over 2.3 million posts from multiple media platforms ranging from January 2015 to November 2022.

Research deliverables

  1. Project report
  2. Poster
  3. Poster presentation

Data Collection

We collected over 2.3 million posts, amounting to 190 GB before being processed down to 1.8 GB, from Facebook, Reddit, The New York Times, The Guardian, CNN, and Twitter ranging from January 2015 to November 2022 including by using the platforms' API and web scraping.

Some of those data collection scripts are located in /Data Collection, while the scripts for collecting posts from CNN and The Guardian are located in:

  1. The Guardian News Article Collector by @farisdurrani
  2. CNN Web Scraper by @farisdurrani

The sentiments were calculated using vaderSentiment~=3.3.2, run by the script make_sentiment.py.

Frontend

Implemented in /frontend, the frontend is created on React.JS, and the visualizations by D3.js and Apexcharts.

Demo video:

demo.mp4

Search bar

The page (localhost:3000 by default) opens with a comfortable search bar where you can type in any search terms to find out the sentiment of those keywords, along with optional options like start and end dates as well as the media platform.

Sentiment Legend

A simple legend shows up underneath, denoting that bars and areas colored red have negative sentiment posts, i.e., posts that are unpleasant, depressing, or harmful. Positive sentiments are blue.

Timeline

The main graph is the Timeline, which shows the count of posts per day over time, and the color of the bars denoting the sentiments. The user can hover over each bar to find out the sentiment and count of posts of that date.

We also show several "significant events" (of our own choosing from Wikipedia) that contain those keywords in circles. These circles have the same height and color as the bar on that date. This allows the user to correlate certain significant change in sentiment with a significant event.

In this example, 1/12/2021 has a great deal of fairly negative posts which correlates with President Trump's second impeachment.

Frequency Chart

As the user hovers over the Timeline above, this frequency chart also updates to show the breakdown of posts on that date.

Bag of Words

We also provide a bubble chart to show the most frequent words in the time period as well as the mean sentiment of the posts containing those words.

Fraction of Sentiment Polarities

The last pie chart derives from the bubble chart above to show the percentage of words that are positve negative (sentiment < 1), neutral (0), and (sentiment > 1).

Backend

Implemented in /backend, the backend of our app is a Python Flask app (app.py with support from api.py, live at localhost:8000 by default) that converts the CSV files containing the data to a pandas dataframe which it uses to return the value queried by the front end as the user changes parameters like keywords and date ranges.

The queries are as follows (a more detailed description can be found in BACKEND_API:

  • / OR /test: Given no parameters, return a simple Hello World message to indicate the backend is alive

  • /api/getSummary: Given some filter parameters, give the summary of sentiment data. For each platform of each date, give its mean sentiment, as well as the mean sentiment of that date and the count of posts on that date.

  • /api/getBodyText: Given a list of post IDs, return the full post data for each post

  • /api/getBagOfWords: Given some filter parameters, return the number of occurrences across all posts of the most common words, along with the mean sentiment of all posts containing that word.

  • /api/getPlatformFrequencies: Given filter parameters, return the count and mean sentiment of posts on each platform

Next Steps

Our project could be improved in the future by making it update in real time as people post to the sites that we have analyzed. Another limitation to achieving this is the speed at which we wanted the tool to update. Our method of collecting the bag of words for the bubble chart is inefficient compared to perhaps using a vectorized approach.

Authors

  1. Faris Durrani @farisdurrani
  2. Justin Zandstra @justinzandstra
  3. Lakshmisree Iyengar @lakshmisree-iitk
  4. Nemath Ahmed @nemathahmed
  5. RenChu Wang @rentruewang
  6. Shuyan Lin @shuyanl915

License

Sentiment Search is MIT licensed, as found in the LICENSE file.

The software's documentation is Creative Commons licensed, as found in the LICENSE-docs file.