Data analysis of a data set - BeerAdvocate

Tackled questions

Which brewery produces the strongest beers by abv?
If you had to pick 3 beers to recommend to someone, how would you approach the problem?

Preface

Firstly, 68'136 rows in the dataset contain at least one NULL value. That's ~ 4.29% of all reviews. I'm removing this data for further discussion.

Question 1

Which brewery produces the strongest beers by abv?

To find an answer to this question, lets begin with looking at a bar plot of abv value distribution among all available beers.

One can see that the vast majority of beers have less than 10% abv. Furthermore, beers that have 20% or more abv are not visible on the plot. From the following QUERY, one can find out that there are only 18 of these beers!

SELECT count(DISTINCT beer_name) FROM beer_reviews where beer_abv > 20

It might not be clear how to answer question of which brewery produces the strongest beers. But let's take a look at which brewery has the highest mean of abv among produced beers.

One can see that the Schorschbräu brewery has the highest average of abv values among its beers. It also produces most number of beers among shown breweries.
On the other hand, one cannot simply choose other breweries that produce the strongest beers, because some of them produce only 1 beer. See that next brewery with the highest average ratio of abv is Shoes Brewery, it only produces 1 beer.

The next step that could lead us to the answer might be looking at the median of beers produced by breweries. It is 4, because of that it seems resonable to choose only these breweries that produce more or equal to 4 beers.

Let's create the same plot bar again, but taking into account the previous assumption.

Now that we know the abv mean and how many beers are produced by each brewery, we can try to answer the given question.

Which brewery produces the strongest beers by abv?

Schorschbräu - is at the top because of the overwhelming abv mean, even though they only make 10 beers.
AleSmith Brewing Company - is another brewery worth considering. They make over 50 beers which is almost 3 times more than a mean of number of beers produced by these top 10. Its mean abv sits strong in the top 4.

SQL Query verification

To confirm the obtained results, one can run queries:

How many distinct beers does each brewery produce

SELECT COUNT ( DISTINCT beer_beerid ) FROM beer_reviews WHERE brewery_name="<BREWERY_NAME>" AND (brewery_id IS NOT NULL) AND (brewery_name IS NOT NULL) AND (review_time IS NOT NULL) AND (review_overall IS NOT NULL) AND (review_aroma IS NOT NULL) AND (review_appearance IS NOT NULL) AND (review_profilename IS NOT NULL) AND (beer_style IS NOT NULL) AND (review_palate IS NOT NULL) AND (review_taste IS NOT NULL) AND (beer_name IS NOT NULL) AND (beer_abv IS NOT NULL) AND (beer_beerid IS NOT NULL);

Number of breweries

SELECT COUNT ( DISTINCT brewery_name ) FROM beer_reviews WHERE (brewery_id IS NOT NULL) AND (brewery_name IS NOT NULL) AND (review_time IS NOT NULL) AND (review_overall IS NOT NULL) AND (review_aroma IS NOT NULL) AND (review_appearance IS NOT NULL) AND (review_profilename IS NOT NULL) AND (beer_style IS NOT NULL) AND (review_palate IS NOT NULL) AND (review_taste IS NOT NULL) AND (beer_name IS NOT NULL) AND (beer_abv IS NOT NULL) AND (beer_beerid IS NOT NULL);

Question 2

If you had to pick 3 beers to recommend to someone, how would you approach the problem?

Since this database contains over a million reviews from 32'908 users, it might be hard to approach this problem with traditional way. I am going to use matrix factorization. To apply this method, I need to filter reviewers that have posted little reviews.

See that most users posted less that 10 reviews.

I will break down the solution into the following parts:

Create a data stating how many times each beer have received maximum overall score. Bypassing false data from the database (*),
Find 3 beers which have the most number of maximum overall reviews and are of different beer style.

Solution

(*) Found problems: When the following query is run:

SELECT review_appearance, review_aroma, review_palate, review_taste, review_overall FROM beer_reviews where review_time = 1235954167

One can see that the review of review_time = 1235954167 has review_overall of 5.0, but all the partial reviews are in range 2.0 - 3.0. This indicates that dataset contains false data.

To eliminate this problem I filter the data indecies which each partial review is smaller that stated constant value (in this case 4.0).
To apply those changes in mysql query, one can run:

SELECT review_appearance, review_aroma, review_palate, review_taste, review_overall
FROM beer_reviews WHERE review_appearance >=4 AND review_aroma>=4 AND review_palate>=4
AND review_taste>=4 GROUP BY review_time
HAVING AVG(review_appearance + review_aroma + review_palate + review_taste) >= 4

Worth noting

I. There might be corrupted data where a brewery with the same name is identified by different brewery_id

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
images		images
README.md		README.md
main.py		main.py
plotter.py		plotter.py
recommender.py		recommender.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data analysis of a data set - BeerAdvocate

Table of Contents

Tackled questions

Preface

Question 1

Which brewery produces the strongest beers by abv?

Which brewery produces the strongest beers by abv?

SQL Query verification

Question 2

If you had to pick 3 beers to recommend to someone, how would you approach the problem?

Solution

Worth noting

About

Uh oh!

Releases

Packages

Languages

patrick-bedkowski/BeerAdvocate-Data-analysis

Folders and files

Latest commit

History

Repository files navigation

Data analysis of a data set - BeerAdvocate

Table of Contents

Tackled questions

Preface

Question 1

Which brewery produces the strongest beers by abv?

Which brewery produces the strongest beers by abv?

SQL Query verification

Question 2

If you had to pick 3 beers to recommend to someone, how would you approach the problem?

Solution

Worth noting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages