Skip to content

Project presents complete analysis of the BeerAdvocate dataset performed using Python and data visualization techniques. The description contains questions related to the dataset and their answers.

Notifications You must be signed in to change notification settings

patrick-bedkowski/BeerAdvocate-Data-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data analysis of a data set - BeerAdvocate


Table of Contents

  1. Tackled questions
    1. Preface
    2. Question 1
    3. Question 2
  2. Worth noting

Tackled questions

  1. Which brewery produces the strongest beers by abv?
  2. If you had to pick 3 beers to recommend to someone, how would you approach the problem?

Preface

Firstly, 68'136 rows in the dataset contain at least one NULL value. That's ~ 4.29% of all reviews. I'm removing this data for further discussion.


Question 1

Which brewery produces the strongest beers by abv?

To find an answer to this question, lets begin with looking at a bar plot of abv value distribution among all available beers.

breweries_abv_count

One can see that the vast majority of beers have less than 10% abv. Furthermore, beers that have 20% or more abv are not visible on the plot. From the following QUERY, one can find out that there are only 18 of these beers!

SELECT count(DISTINCT beer_name) FROM beer_reviews where beer_abv > 20

It might not be clear how to answer question of which brewery produces the strongest beers. But let's take a look at which brewery has the highest mean of abv among produced beers.

breweries_abv_count

One can see that the Schorschbräu brewery has the highest average of abv values among its beers. It also produces most number of beers among shown breweries.
On the other hand, one cannot simply choose other breweries that produce the strongest beers, because some of them produce only 1 beer. See that next brewery with the highest average ratio of abv is Shoes Brewery, it only produces 1 beer.

The next step that could lead us to the answer might be looking at the median of beers produced by breweries. It is 4, because of that it seems resonable to choose only these breweries that produce more or equal to 4 beers.

Let's create the same plot bar again, but taking into account the previous assumption.

breweries_abv_count_mean

Now that we know the abv mean and how many beers are produced by each brewery, we can try to answer the given question.

Which brewery produces the strongest beers by abv?

  1. Schorschbräu - is at the top because of the overwhelming abv mean, even though they only make 10 beers.

  2. AleSmith Brewing Company - is another brewery worth considering. They make over 50 beers which is almost 3 times more than a mean of number of beers produced by these top 10. Its mean abv sits strong in the top 4.


SQL Query verification

To confirm the obtained results, one can run queries:

  • How many distinct beers does each brewery produce
SELECT COUNT ( DISTINCT beer_beerid ) FROM beer_reviews WHERE brewery_name="<BREWERY_NAME>" AND (brewery_id IS NOT NULL) AND (brewery_name IS NOT NULL) AND (review_time IS NOT NULL) AND (review_overall IS NOT NULL) AND (review_aroma IS NOT NULL) AND (review_appearance IS NOT NULL) AND (review_profilename IS NOT NULL) AND (beer_style IS NOT NULL) AND (review_palate IS NOT NULL) AND (review_taste IS NOT NULL) AND (beer_name IS NOT NULL) AND (beer_abv IS NOT NULL) AND (beer_beerid IS NOT NULL);
  • Number of breweries
SELECT COUNT ( DISTINCT brewery_name ) FROM beer_reviews WHERE (brewery_id IS NOT NULL) AND (brewery_name IS NOT NULL) AND (review_time IS NOT NULL) AND (review_overall IS NOT NULL) AND (review_aroma IS NOT NULL) AND (review_appearance IS NOT NULL) AND (review_profilename IS NOT NULL) AND (beer_style IS NOT NULL) AND (review_palate IS NOT NULL) AND (review_taste IS NOT NULL) AND (beer_name IS NOT NULL) AND (beer_abv IS NOT NULL) AND (beer_beerid IS NOT NULL);



Question 2


If you had to pick 3 beers to recommend to someone, how would you approach the problem?


Since this database contains over a million reviews from 32'908 users, it might be hard to approach this problem with traditional way. I am going to use matrix factorization. To apply this method, I need to filter reviewers that have posted little reviews.

See that most users posted less that 10 reviews.

number_of_reviews

I will break down the solution into the following parts:

  1. Create a data stating how many times each beer have received maximum overall score. Bypassing false data from the database (*),
  2. Find 3 beers which have the most number of maximum overall reviews and are of different beer style.

Solution

exercise_2



(*) Found problems: When the following query is run:

SELECT review_appearance, review_aroma, review_palate, review_taste, review_overall FROM beer_reviews where review_time = 1235954167

One can see that the review of review_time = 1235954167 has review_overall of 5.0, but all the partial reviews are in range 2.0 - 3.0. This indicates that dataset contains false data.

To eliminate this problem I filter the data indecies which each partial review is smaller that stated constant value (in this case 4.0).
To apply those changes in mysql query, one can run:

SELECT review_appearance, review_aroma, review_palate, review_taste, review_overall
FROM beer_reviews WHERE review_appearance >=4 AND review_aroma>=4 AND review_palate>=4
AND review_taste>=4 GROUP BY review_time
HAVING AVG(review_appearance + review_aroma + review_palate + review_taste) >= 4

Worth noting

I. There might be corrupted data where a brewery with the same name is identified by different brewery_id

About

Project presents complete analysis of the BeerAdvocate dataset performed using Python and data visualization techniques. The description contains questions related to the dataset and their answers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages