Skip to content

Analysis of a customer review dataset from Amazon. ETL process is performed using PySpark and is connected to an AWS RDS instance. Pandas is used to determine if there is any bias towards favorable reviews from "Amazon Vine" members.

Notifications You must be signed in to change notification settings

Mishkanian/Amazon_Vine_Analysis

Repository files navigation

Amazon Vine Analysis

Project Overview

The purpose of this project is to analyze customer review dataset from Amazon. Using PySpark, the ETL process is performed to extract the dataset, transform the data, and connect to an AWS RDS instance. The transformed data is then loaded into pgAdmin. Finally, Pandas is used to determine if there is any bias towards favorable reviews from "Amazon Vine" members in the dataset.

Data Source

The dataset is extracted from Amazon's US Reviews Dataset. From this list, the Video Game dataset is chosen for analysis. Click here to download the full Video Game dataset.

What is Amazon Vine?

Amazon Vine is a program launched by Amazon.com that allows manufacturers and publishers to receive reviews for their products from a filtered group of Amazon customers, called "Vine Voices." These Vine Voices are chosen based on several criteria, including their total number of reviews and helpfulness of reviews. In exchange for free products, these Vine Voices are required to publish a review. Amazon's Vine help guide states that "Voices are not paid" and that Amazon welcomes an "honest opinion about the product."

Extract, Transform, Load (ETL)

Using Amazon_Reviews_ETL.ipynb, the video game dataset is extracted into the following DataFrames: customers_df, products_df, review_id_df, and vine_df. After connecting to the AWS RDS instance, each of these DataFrames are written to the existing tables in pgAdmin. The password and url used to configure the settings for the RDS have been hidden for security, you will need to apply your own information in this section.

postgres_table

For the purpose of this project, only the vine_table is necessary, which is exported from pgAdmin as vine.csv (Download the vine.zip file here).

Determining Review Bias

To determine if there is any review bias, Pandas is used to filter and create new DataFrames. This potion of the analysis is found in Vine_Review_Analysis.ipynb.

The vine.csv file is read in as DataFrame:

vine_df

In the first filter, vine_df is filtered to only show rows where the number of total votes is greater than or equal to 20. Doing this will help pick reviews that more likely to be helpful and to avoid having division by zero errors. This filter is saved as a new DataFrame. first_filter

A second filter (Filter #2) is then used on previous filter (Filter #1) to create a new DataFrame that retrieves all rows where the number of helpful votes divided by the total votes is greater than or equal to 50%.

second filter

Finally, two more DataFrames are created to separate Filter #2 between reviews written as part of the Vine program (paid) and reviews not part of the Vine program (unapid). After creating these final DataFrames, the following metrics are determined:

  • The total number of reviews.
  • The number of 5-star reviews.
  • The percentage of 5-star reviews (Paid and Unpaid).

Results

For the Video Game dataset:

  • There are only 94 Vine reviews.
    • 48 of Vine reviews gave 5-stars.
    • Approximately 51.06% of Vine reviews were 5-stars.
  • There are 40,471 non-Vine reviews.
    • 15,663 non-Vine reviews gave 5-stars.
    • Approximately 38.70% of non-Vine reviews were 5-stars.

Summary

Based on this analysis, there appears to be a positivity bias among Video Game reviews in the Vine program. While only 38.70% of regular reviews gave 5-stars, 51.06% of Vine reviews gave 5-stars.

However, it should be noted that the data present in this dataset is not reflective of a single product. This dataset contains a multitude of different hardware, software and accessories for different video game consoles. As a result of this large variety of products, this analysis cannot be applied to individual products, but rather the dataset as a whole. Furthermore, out of the 40,565 data points analyzed, only 0.23% are Vine reviews. This small amount of reviews is not significant enough to sway the overall rating of products listed for sale on Amazon.

Recommendations for Further Analysis

One additional analysis that could be performed on this dataset to further research the possibility of positivity bias is to compare the average Vine review ratings to the average customer rating. If it is found that Vine customers have a higher average star rating than non-Vine customers, this might be an indication of positivity bias.

Author: Michael Mishkanian
For all questions and inquiries, please contact me on LinkedIn.

About

Analysis of a customer review dataset from Amazon. ETL process is performed using PySpark and is connected to an AWS RDS instance. Pandas is used to determine if there is any bias towards favorable reviews from "Amazon Vine" members.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published