lab_2.Rmd

---
title: "Make Sure to Drop a Like—The Effect of Consumer Rating on Application Success"
author: "Oleg Ananyev, Oren Carmeli, Romain Hardy, Sam Rosenberg"
date: "Fall 2021"
output:
  bookdown::pdf_document2: 
    toc: false
urlcolor: blue
header_includes:
- \usepackage{amsmath}
- \usepackage{amssymb}
- \usepackage{amsthm}
---

```{r setup, include=FALSE} 
knitr::opts_chunk$set(warning = FALSE, message = FALSE, fig.pos = 'H') 
options(tinytex.verbose = TRUE)
```

\tableofcontents

\section{Introduction}

\par
In 2020, Statista [reported](https://www.statista.com/statistics/271644/worldwide-free-and-paid-mobile-app-store-downloads/) a total of 220 billion mobile application downloads globally. Another [study](https://buildfire.com/app-statistics/) found that approximately 70% of all digital media spend in the United States is spent on mobile applications. As digital applications become the primary media through which organizations engage with their customers, it becomes increasingly valuable to understand the drivers that lead to application success.

\par
The Google Play Store is one of the major hubs for Android mobile phone and tablet applications. Users can download any application on the Google Play Store for personal consumption across a wide range of categories. Making an application stand out in a sea of thousands of competing applications is no trivial task, however. To succeed, developers must carefully consider factors such as price, application size, and genre. Another variable that may be critical to an application's success is consumer rating. Today, smart algorithms play a key role in suggesting applications to consumers, creating a feedback loop that propels certain applications towards success and leaves others behind. An understanding of the relationship between consumer rating and application success would be incredibly valuable to developers seeking to create the next viral application.

\par
The following study is a causal analysis of the relationship between consumer rating and application success. Using a data set scraped directly from the Google Play Store, we will build linear models that assess the importance (or lack of importance) of consumer rating on application success, with additional variables such as price, application size, and application category serving as controls. If our study supports the existence of a causal pathway, it would signal application developers to invest heavily in improving their ratings, for instance by interviewing customers and implementing new features. Our study may also be of interest to Google and other companies seeking to improve consumer engagement with their applications. 

\par
Given our prior beliefs on what factors motivate individuals to download applications, we believe there are omitted variables not included in our data set that may influence application success. These include product brand awareness, application store rankings, and the total addressable market, among possible others. In spite of these limitations, our study should provide useful insight into the causal factors of application success.

\par
The paper is structured as follows. Section 2 outlines our research question, and Section 3 describes the causal theory we will use to contextualize our analysis. Section 4 describes the data we leverage for our models and Section 5 discusses our research design. Section 6 highlights our exploratory data analysis, followed by our statistical models in Section 7 and results in Section 8. Section 9 considers the limitations of our models, and Section 10 presents our conclusions.

\section{Research Question}
The goal of this study is to assess the causal factors of application success. Specifically, we seek to determine if there is a statistically significant relationship between consumer rating and application success. In the ensuing sections, we will answer the following question:

>\textit{Does having a higher consumer rating score lead to more downloads for Google Play Store applications?}

\section{Causal Theory}
Before we can discuss our data and research design, we must first describe the causal model which will serve as the reference point for our analysis. We identify five factors that bear causal influence on the success of an application. These are (1) consumer rating, (2) price, (3) category, (4) age, and (5) size. Our proposed causal graph is shown below.

![A hypothetical causal diagram for Google Play Store applications.](./pictures/simple_model.PNG){#image1 .class width=50%}

\subsubsection{Application Success}
Our objective is to assess whether consumer rating has a positive effect on application success. There are different ways in which we could operationalize success; we choose to use raw download count as a surrogate, since it directly measures the number of consumers that made the decision to download an application. Although this choice ignores other potential aspects of success, such as revenue and social impact, it is effective in its simplicity and appropriate for an exploratory study.

\subsubsection{Consumer Rating}
The leading independent variable of interest is consumer rating. We hypothesize that higher ratings improve application success, because an application that has been highly rated is one that has been deemed worthwhile by other users. When consumers decide which applications to download, they will likely trust the opinions of their peers and select those with positive reviews. Conversely, we expect applications with negative ratings to have less success, since other users have judged them poorly. There should not be causal pathways leading from consumer rating to any of the other explanatory variables, since they are determined during the development phase of an application whereas consumer rating is decided once an application is published to the Google Play Store.

\par
Although we have not included it in the diagram, there is the possibility that a reverse causal pathway exists from application success to consumer rating. Successful applications are those that are enjoyable to a large number of consumers. Therefore, it is possible that users will rate applications with high download counts higher than applications with low download counts because they are primed to believe that such applications are better—otherwise, the successful applications would not have received so many downloads. Generally, we expect the reverse pathway to be weaker than the forward one. If rating and success have positive effects on one another, however, our models may suffer from positive feedback. We will discuss the implications of this effect in Section 9.

\subsubsection{Price}
We expect price to have a causal effect on application success. Generally, we believe that free applications will have higher download counts due to the lack of a monetary barrier. Of course, this may not always be the case, as paid applications could offer better features and more desirable experiences, thus encouraging downloads. We also anticipate a causal pathway from price to consumer rating. In particular, we believe that consumers are more likely to rate paid applications positively than they are to rate free applications positively, since the former type offers better features.

\subsubsection{Category}
\par
The category that an application belongs to is likely to affect its success. Certain categories of applications appeal to broad audiences and are more likely to find success than applications which appeal to smaller subsets of consumers. This interaction may not necessarily be so straightforward, however. If an application belongs to a popular category, then it also has to compete with other applications in the same category, which may in fact be detrimental to its success. Meanwhile, applications belonging to niche categories could have a greater chance of achieving success simply due to the fact that they have fewer competitors. Globally, we expect that the most successful applications will belong to popular categories, but that moderately successful applications will be spread across different categories. 
\par
Category is also a predictor of consumer rating. Due to stylistic and functional differences between application categories, it is likely that they are reviewed against different criteria. For example, a consumer reviewing a mobile game may place emphasis on the graphics, the fluidity of the controls, and the balance of the game mechanics, among others. A lifestyle application, on the other hand, will probably be judged on completely different features, such as ease of use, relevance in every day life, and usefulness. If review criteria depend on application category, then consumer ratings assuredly do too. It is difficult to predict in advance what categories are positively or negatively associated with consumer ratings, though we expect categories with narrower consumer bases to receive harsher ratings. Categories of applications that can lead users to feel frustrated, such as games, social media, and dating applications, may also receive more negative reviews on average.

\subsubsection{Age}
\par
If we are to interpret application success in terms of the number of downloads an application accumulates, then the age of an application is necessarily an influential factor. Applications can only receive more downloads as time passes, so the longer an application remains on the Google Play Store, the more downloads it is likely to have. As an example of why this is important, consider two applications: one that was uploaded one year ago, and one that was uploaded one week ago. The older application receives 100 downloads a month for the whole year while the newer application receives 1,000 downloads in one week. If we were to only compare the raw download counts (1,200 to 1,000), it would seem as if the older application is more successful. By bringing age into the analysis, we are able to compare the applications by download rate instead of count, thus realizing that the newer application is more successful than its older counterpart. 
\par
We do not expect age to have causal effects on the other explanatory variables, though it might be possible to argue that consumer rating and price are affected by age. Unless our analysis clearly demonstrates otherwise, we will assume these effects are negligible.

\subsubsection{Size}
\par
We foresee two opposing sides to the relationship between application size and success. The first is a negative effect; given the limited space available on mobile and tablet devices, users may be more likely to download smaller applications. Alternatively, application size could be an indicator of production quality, in which case users may prefer larger applications over smaller ones. We hypothesize that the first effect takes precedence.

\subsubsection{Epsilon}
Epsilon represents variables which may influence application success but are independent from the other variables in our causal graph. For instance, these could include geographical location or the time of day at which a download occurs. Crucially, we assume that there do not exist any directed paths from epsilon to any of the five explanatory variables. This guarantees independence and is necessary for ordinary least-squares regression to be valid.

\section{Data}
To answer our research question, we will leverage public data about applications available on the Google Play Store. This data was randomly scraped from the Google Play Store interface and uploaded to [Kaggle](https://www.kaggle.com/lava18/google-play-store-apps) in 2019. It contains key information about sampled applications, such as downloads, file size, consumer rating, category, and price. In total, the data contains records of about 10,000 applications. 

\par
For the modeling phase, we will use ordinary least-squares (OLS) regression. OLS regression is the plug-in estimator for the best linear predictor of a dependent random variable given the joint distribution of a set of independent random variables. Although OLS regression is often used with the goal of making predictions on new data, we will instead use it to answer a causal question about the relationship between variables. By interpreting model coefficients within the context of our causal theory, we will develop a statistically valid argument that addresses the research question.

\par
In our case, the dependent variable is application success and the independent variables are the predictors included in our causal model—consumer rating, price, category, age, and size. Unfortunately, these variables do not map exactly onto the fields in our data set, so we will need to make certain approximations. The features that most closely correspond to our dependent and independent variables are listed below.

\subsection{Dependent Variable}

* Application Success — `installs` (the accumulated number of downloads since the application was uploaded to the Google Play Store)

\subsection{Independent Variables}

* Consumer Rating — `rating` (the average consumer rating of an application, out of 5)
* Price — `price` (the price of the application) and `type` (the price type of the application, free or paid)
* Category — `category` (the category tagged for the application, i.e Lifestyle, Game) and `content_rating` (the official content rating given to the application, i.e Teen, Everyone, Mature 17+)
* Age — `current_version` (the current version of the application) and `last_updated` (the number of years since the application was last updated)
* Size — `size` (the download size of the application, in units of MB)

\section{Research Design}
\par
We seek to test our hypothesis that positive consumer ratings lead to greater success for Google Play Store applications. To moderate and refine our analysis, we will incorporate four additional control variables: price, category, age, and download size. Price lets us differentiate between paid and free applications, while category lets us differentiate between different genres and target audiences. Age is included to account for the fact that applications which have been available in the store for a long time have an innate advantage over applications which were uploaded recently; with age as a variable, we can directly compare applications which were uploaded simultaneously. Finally, we include download size as it could be an indicator of production quality.

\par
Our data set offers a cross-sectional view of Google Play Store applications in 2019. Since not every feature maps directly to the variables we have defined in our causal framework, we make certain approximations, as listed above. Although these mappings are sometimes imperfect, we believe they are sufficient for a meaningful analysis.

\par
Before we proceed with building statistical models, we will conduct a thorough exploratory analysis of the data. We will note important patterns and trends in the data set, filter problematic entries and outliers, and justify necessary variable transformations. From there, we will build three models of increasing complexity and interpret the model coefficients, verify underlying assumptions, and discuss possible limitations. The first model will estimate how `installs` depends on `rating`, and will serve as a baseline for further analysis. The second model introduces control variables from our causal theory which we hypothesize have an effect on success and consumer ratings. The third and final model explores interactions between ratings and other explanatory variables. As justification for adding specific covariates, we will provide visualizations and conduct statistical tests that demonstrate their significance.

\section{Exploratory Data Analysis}

Prior to exploring the data, we will create rules to filter and clean records based on logical conditions. This procedure involves removing duplicate records, removing records with null review counts, and removing records with consumer ratings greater than 5. Since our research question focuses on consumer rating, we elect to only keep records with valid values for that field. Intuitively, we do not believe that consumer rating is a suitable predictor if there are fewer than 100 ratings in total, so we only keep applications that exceed this amount. Although this step removes almost 25% of the data, it is acceptable given that the initial data set contains approximately 10,000 records. 

\par
After these initial operations, the cleaned data set contains 7,226 records (distinct applications) and 24 metadata columns, 11 of which we built. We split up the exploration into two sections based on the type of the variable, numeric and categorical. 

```{r environment set-up, echo = FALSE}
install.packages("DT")
install.packages("GGally", repos = "https://ftp.osuosl.org/pub/cran/")
install.packages("moments", repos = "https://ftp.osuosl.org/pub/cran/")
install.packages("corrplot", repos = "http://cran.us.r-project.org")
install.packages("equatiomatic")

library(corrplot)
library(data.table)
library(DT)
library(equatiomatic)
library(GGally)
library(ggplot2) 
library(kableExtra)
library(knitr)
library(latex2exp)
library(lmtest)
library(lubridate)
library(moments)
library(sandwich)
library(stargazer)
library(tidyverse)

# Pull in functions
source('./functions/get_robust_se.R')
source('./functions/get_clean_dataset.R')
source('./functions/eda_calculate_stats_by_group.R')
source('./functions/eda_build_quantile_table.R')

# Knitr options
knitr::opts_chunk$set(echo = TRUE)
```

```{r load and clean data, echo = FALSE}
data_clean <- get_clean_dataset(minimum_review_count = 0)
```

\subsection{Numeric Variables}

We want to understand the underlying distribution of each numeric variable as well as examine the correlations and covariances among them. The distributions are used to measure the quality of the data, identify outliers, and evaluate the need for variable transformations. In addition, understanding the correlations between variables helps us to highlight those that might explain the variance of our dependent variable, and to quantify the level of collinearity between different features.

\subsubsection{Distributions}

``` {r distribution summary, results = "asis", echo = FALSE,}
numeric_cols <- c(
  "installs",
  "size",
  "reviews",
  "rating",
  "price",
  "current_version"
)

stargazer(
  data_clean[, numeric_cols],
  column.sep.width = "3pt",
  font.size = "small",
  header = FALSE,
  title = "Distribution statistics for numeric variables"
)
```

After the cleaning step, all feature values appear valid; there are no negative values, null values, or values close to infinity. Notably, only two fields contain zero values, `price` and `current_version`. This will be important when deciding whether or not to apply logarithmic transformations. We also note that we have treated `current_version` (the version number of an application) as a metric variable. Although this is technically incorrect, we feel it is justified given that this feature is ordinal in scale and has consistent intervals.

\par
From the differences between their medians and means, `size` and `reviews` appear to have approximately normal distributions, while the others have strong right or left skews. Since we expect a causal pathway to exist from `installs` to `reviews` (i.e. high install counts lead to high review counts), we do not include `reviews` as a predictor in our analysis. 

**Application Success**

```{r application success overview, echo = FALSE}
max_installs  <- max(data_clean$installs, na.rm = TRUE)
med_installs  <- median(data_clean$installs, na.rm = TRUE)
skew_installs <- skewness(data_clean$installs, na.rm = TRUE)
```

\par
Application success is measured using the `installs` feature, which represents the raw download count of an application. It is important to note here that `installs` is a binned feature. The bins start at 1 and scale upwards logarithmically; 1+, 5+, 10+, 100+, 500+, etc. For example, a value of 100+ means that the application has between 100 and 499 downloads. In the cleaning step, we remove the + sign and convert `installs` to a metric variable. This conversion is valid because the feature is ordinal and there is a measurable distance between bins. Given that this distance scales logarithmically, we can also claim that the distance between bins is consistent. Although there is precision error due to the fact that binning obscures the true download count, we believe that `installs` can be treated as a metric variable in practice. The raw distribution has a strong right skew of $\tilde{\mu}_3 = `r signif(skew_installs, 3)`$; the maximum is `r signif(max_installs, 3)` and the median is `r signif(med_installs, 3)`. Unsurprisingly, applying a logarithmic transformation causes the resulting distribution to resemble a normal distribution, which makes it an appropriate transformation for future modeling.

``` {r distribution plots (installs), figures-side, fig.show = "hold", out.width = "50%", echo = FALSE}
ggplot(data = data_clean, aes(x = installs)) +
  geom_histogram() + 
  geom_density(aes(y=0.5*..count..)) + 
  geom_vline(xintercept = median(data_clean$installs)) +
  xlab("Installs") +
  ylab("Frequency") +
  ggtitle("Raw distribution of install counts")

ggplot(data = data_clean, aes(x = log_installs)) +
  geom_histogram() +
  geom_density(aes(y=0.5*..count..)) + 
  geom_vline(xintercept = median(data_clean$log_installs)) +
  xlab(TeX("$\\log_{10}(Installs)$")) +
  ylab("Frequency") +
  ggtitle("Log-distribution of install counts")
```

**Consumer Rating** 

```{r consumer rating overview, echo = FALSE}
max_rating  <- max(data_clean$rating, na.rm = TRUE)
med_rating  <- median(data_clean$rating, na.rm = TRUE)
skew_rating <- skewness(data_clean$rating, na.rm = TRUE)
```

\par
Consumer rating is measured using the `rating` feature, which represents the average consumer rating of an application. This feature appears to have a distribution that is approximately normal, but that is negatively skewed towards larger values ($\tilde{\mu}_3 = `r signif(skew_rating, 3)`$). Average ratings range between 0 and 5, with a median of `r signif(med_rating, 3)`. Applying a logarithmic transformation only weakly affects the distribution, so we opt to keep the raw feature instead.

``` {r distribution plots (rating), figures-side, fig.show = "hold", out.width = "50%", echo = FALSE}
ggplot(data = data_clean, aes(x = rating)) +
  geom_histogram() + 
  geom_density(aes(y=0.5*..count..)) + 
  geom_vline(xintercept = median(data_clean$rating)) +
  xlab("Rating") +
  ylab("Frequency") +
  ggtitle("Raw distribution of consumer ratings")

ggplot(data = data_clean, aes(x = log_rating)) +
  geom_histogram() +
  geom_density(aes(y=0.5*..count..)) + 
  geom_vline(xintercept = median(data_clean$log_rating)) +
  xlab(TeX("$\\log_{10}(Rating)$")) +
  ylab("Frequency") +
  ggtitle("Log-distribution of consumer ratings")
```

**Size**

```{r size overview, echo = FALSE}
max_size  <- max(data_clean$size, na.rm = TRUE)
med_size  <- median(data_clean$size, na.rm = TRUE)
skew_size <- skewness(data_clean$size, na.rm = TRUE)
```

\par
Download size is measured using the `size` feature. `size` has a distribution that is slightly closer to normal than `installs`, though it still has a noticeable right-leaning tail ($\tilde{\mu}_3 = `r signif(skew_size, 3)`$). The maximum download size is `r signif(max_size, 3)` MB, and the median is `r signif(med_size, 3)` MB. Applying a logarithmic transformation shifts the distribution closer to normality. Although the log-distribution fluctuates near the middle, we still find it more appropriate than the raw distribution for modeling. 

``` {r distribution plots (size), figures-side, fig.show = "hold", out.width = "50%", echo = FALSE}
ggplot(data = data_clean, aes(x = size)) +
  geom_histogram() + 
  geom_density(aes(y=0.5*..count..)) +  
  geom_vline(xintercept = median(data_clean$size)) +
  xlab("Size") +
  ylab("Frequency") +
  ggtitle("Raw distribution of download sizes")

ggplot(data = data_clean, aes(x = log_size)) +
  geom_histogram() +
  geom_density(aes(y=0.5*..count..)) + 
  geom_vline(xintercept = median(data_clean$log_size)) +
  xlab(TeX("$\\log_{10}(Size)$")) +
  ylab("Frequency") +
  ggtitle("Log-distribution of download sizes")
```

**Price**

```{r echo = FALSE}
max_price  <- max(data_clean$price, na.rm = TRUE)
med_price  <- median(data_clean$price, na.rm = TRUE)
skew_price <- skewness(data_clean$price, na.rm = TRUE)
```

\par
Application price is measured using the `price` feature. In total, `r signif(mean(data_clean$is_free), 3) * 100.0`% of applications in the data set are free to download; the median price is `r signif(med_price, 3)`, and the skew is $\tilde{\mu}_3 = `r signif(skew_price, 3)`$. As a result, we believe that neither the raw distribution nor the log-distribution of `price` are desirable for modeling, as they deviate too far from normality. Instead, we will transform `price` into an indicator variable that takes value 1 for paid applications and 0 for free applications.

``` {r distribution plots (price), figures-side, fig.show = "hold", out.width = "50%", echo = FALSE}

ggplot(data = data_clean, aes(x = price)) +
  geom_histogram() + 
  geom_density(aes(y=0.5*..count..)) + 
  geom_vline(xintercept = median(data_clean$price)) +
  xlab("Price") +
  ylab("Frequency") +
  ggtitle("Raw distribution of prices")

ggplot(data = data_clean, aes(x = log_price)) +
  geom_histogram() +
  geom_density(aes(y=0.5*..count..)) + 
  geom_vline(xintercept = median(data_clean$log_price)) +
  xlab(TeX("$\\log_{10}(Price)$")) +
  ylab("Frequency") +
  ggtitle("Log-distribution of prices")
```

**Age**

```{r age overview, echo = FALSE}
max_current_version  <- max(data_clean$current_version, na.rm = TRUE)
med_current_version  <- median(data_clean$current_version, na.rm = TRUE)
skew_current_version <- skewness(data_clean$current_version, na.rm = TRUE)
max_last_updated  <- max(data_clean$last_updated, na.rm = TRUE)
med_last_updated  <- median(data_clean$last_updated, na.rm = TRUE)
skew_last_updated <- skewness(data_clean$last_updated, na.rm = TRUE)
```

\par
The age of an application is measured using the proxy features `current_version` and `last_updated`. We notice that that `current_version` is skewed right ($\tilde{\mu}_3 = `r signif(skew_current_version, 3)`$), with a maximum value of `r signif(max_current_version, 3)` and a median of `r signif(med_current_version, 3)`. Similarly, `last_updated` is skewed right ($\tilde{\mu}_3 = `r signif(skew_last_updated, 3)`$), with a maximum value of `r signif(max_last_updated, 3)` and a median of `r signif(med_last_updated, 3)`. Due to their strong positive skews, we apply logarithmic transformations to both features before the modeling phase.

``` {r distribution plots (age), figures-side, fig.show = "hold", out.width = "50%", echo = FALSE}

ggplot(data = data_clean, aes(x = current_version)) +
  geom_histogram() + 
  geom_density(aes(y=0.5*..count..)) + 
  geom_vline(xintercept = median(data_clean$current_version)) +
  xlab("Version") +
  ylab("Frequency") +
  ggtitle("Raw distribution of current versions")

ggplot(data = data_clean, aes(x = log_current_version)) +
  geom_histogram() +
  geom_density(aes(y=0.5*..count..)) + 
  geom_vline(xintercept = median(data_clean$log_current_version)) +
  xlab(TeX("$\\log_{10}(Version)$")) +
  ylab("Frequency") +
  ggtitle("Log-distribution of current versions")

ggplot(data = data_clean, aes(x = last_updated)) +
  geom_histogram() + 
  geom_density(aes(y=0.5*..count..)) + 
  geom_vline(xintercept = median(data_clean$last_updated)) +
  xlab("Years") +
  ylab("Frequency") +
  ggtitle("Raw distribution of years since the last update")

ggplot(data = data_clean, aes(x = log_last_updated)) +
  geom_histogram() +
  geom_density(aes(y=0.5*..count..)) + 
  geom_vline(xintercept = median(data_clean$log_last_updated)) +
  xlab(TeX("$\\log_{10}(Years)$")) +
  ylab("Frequency") +
  ggtitle("Log-distribution of years since the last update")
```

\subsubsection{Correlations}

Aside from `log_reviews`, none of the other numeric features have strong correlations with `installs`. The high correlation coefficient between `log_reviews` and `log_installs` supports our hypothesis that there is a causal path from `installs` to `reviews`; indeed, consumers will generally only review applications once they have downloaded them. Since one of our goals is to build an efficient model, we would ideally like to see stronger correlations between the independent variables and the dependent variable. There may be latent predictive power in the interactions between variables, however. Notably, we observe that the correlation coefficient between `rating` and `log_installs` increases by a factor of two when we exclude applications with fewer than 100 reviews. This agrees with our assumption that consumer rating bears causal influence on application success, and motivates our choice to remove applications with low review counts. Lastly, the correlation plots do not indicate high levels of collinearity among the numeric variables. 

``` {r correlation plots, figures-side, fig.show = "hold", out.width = "50%", echo = FALSE}
numeric_cols <- c(
  "log_installs",
  "log_reviews",
  "rating",
  "log_size",
  "is_free",
  "log_current_version",
  "log_last_updated"
)

corrplot(cor(data_clean[,numeric_cols], use = "complete.obs"), 
         method = "number",
         mar = c(0,0,1,0), # http://stackoverflow.com/a/14754408/54964
         title = "All applications")

corrplot(cor(data_clean[data_clean$reviews >= 100, numeric_cols], use = "complete.obs"), 
         method = "number",
         mar = c(0,0,1,0), # http://stackoverflow.com/a/14754408/54964
         title = "Applications with at least 100 reviews (25th PCTL)")
```

\subsection{Categorical Variables}

For the categorical features, we want to aggregate the frequency and mean of `log_installs` by feature sublabel. We use the mean as a measure of central tendency rather than the median given the approximately normal distribution of `log_installs`. To identify which categorical features have the largest dispersion across sublabels, we built a quantile table using the sublabel averages of `log_installs` as the datapoints for each categorical feature. We choose to exclude sublabels with fewer than 100 applications to reduce noise.

``` {r plots, echo = FALSE, results = "asis"}
categorical_cols <- c(
  "category",
  "content_rating",
  "current_version",
  "android_version"
)

# Apply function to all categorical columns 
table_long_cat <-
  rbindlist(lapply(
    categorical_cols,
    eda_calculate_stats_by_group,
    dt = as.data.table(data_clean)
  ))
table_quantile_cat <-
  rbindlist(
    lapply(
      categorical_cols,
      eda_calculate_stats_by_group,
      dt = as.data.table(data_clean),
      quantile_table = TRUE
    )
  )

kable(table_quantile_cat, caption = "Quantile summary table", digits = 4, booktabs = T) %>%
  kable_styling(latex_options = "HOLD_position", font_size = 8)

kable(table_long_cat[Variable == "category"], caption = "Application category summary table", digits = 4, booktabs = T) %>%
  kable_styling(latex_options = "HOLD_position", font_size = 8)

kable(table_long_cat[Variable == "content_rating"], caption = "Content rating summary table", digits = 4, booktabs = T) %>%
  kable_styling(latex_options = "HOLD_position", font_size = 8)
```

From the quantile summary table, all four categorical features have at least a 20% dispersion across the minimum and maximum sublabels. This indicates that these features might possess predictive power as inputs to regression models. Notably, the `category` column seems to have the largest dispersion across groups, implying that it could be greatly impactful. We have omitted the variable `android_version` as it is difficult to interpret. 

\par
Overall, this analysis motivates using `category` and `content_rating` as explanatory variables in regression models. To ease interpretability and limit the number of variables, we will use binned and binary versions of these variables, isolating the sublabels with the highest frequencies.

\section{Statistical Models}

To model the effect of consumer ratings on application success (as understood by the number of downloads an application receives), we will use ordinary least squares regression. Our first model will regress `log_installs`—the logarithmic transform of `installs`—against `rating`. We will then create two more explanatory models to expand on this baseline model, including additional control variables (Section 4.2) and interaction terms. Based on the results of our exploratory data analysis, we have also decided to apply transformations to some of our variables. Our focus is to maximize the predictive power $R^2$ while maintaining models that are easily interpretable. The three models we will test are:

```{r statistical models, echo=FALSE}
model_small  <- lm(log_installs ~ 1 + rating, data = data_clean)
model_medium <- lm(log_installs ~ 1 + rating + log_size + log_current_version +
                     log_last_updated + is_free + is_family_category + 
                     is_game_category + is_tools_category + is_content_everyone,
                   data = data_clean)
model_large  <- lm(log_installs ~ 1 + rating + log_size + log_current_version +
                     log_last_updated + is_free + is_content_everyone +
                     rating * is_family_category + rating * is_game_category + 
                     rating * is_tools_category,
                   data = data_clean)

extract_eq(model_small, wrap = TRUE, terms_per_line = 3, intercept = "beta", operator_location = "start")
extract_eq(model_medium, wrap = TRUE, terms_per_line = 2, intercept = "beta", operator_location = "start")
extract_eq(model_large, wrap = TRUE, terms_per_line = 2, intercept = "beta", operator_location = "start")
```

```{r f-tests to compare models, echo = FALSE}
f1 <- anova(model_small, model_medium)
f2 <- anova(model_medium, model_large)
```

As justification for including additional covariates in the second and third models, we also conduct $F$-tests with the null hypothesis that the smaller model is the correct population model. Comparing the first and second models yields $F = `r signif(f1[2, 5], 3)`$ and $p = `r signif(f1[2, 6], 3)`$, allowing us to reject the null. Comparing the second and third models yields $F = `r signif(f2[2, 5], 3)`$ and $p = `r signif(f2[2, 6], 3)`$, also allowing us to reject the null. Therefore, our inclusion of these covariates is cogent.

\section{Results}

In this section, we present the results of the OLS regression models described above. We discuss the statistical significance of the different coefficients, as well as their meaning in a practical sense. The dependent variable for each of these models is `log_installs`, the logarithmic transformation of application install count.

```{r regression results, results = "asis", echo=FALSE}
# Regression results
stargazer(
  model_small,
  model_medium,
  model_large,
  header = FALSE,
  title = "Results of three regression models",
  type = "latex",
  se = list(get_robust_se(model_small), get_robust_se(model_medium), 
            get_robust_se(model_large)),
  column.sep.width = "3pt",
  font.size = "small"
)
```

A few observations stand out from Table 5. First, all model coefficients are statistically significant within a threshold of $0.1$ excluding two, `is_family_category` in Model 2 and `is_game_category` in Model 3. These variables are derived from `category`, which denotes the category of an application. The coefficient estimate for `is_family_category` also stands out in the sense that it is close to zero, indicating that the Family category is a poor standalone predictor of application success. In Model 3, the coefficient for `is_game_category` is an intercept term and is thus less relevant.

Although the explanatory power of Model 3 ($R^2 = 0.251$) is slightly greater than that of Model 2 ($R^2 = 0.247$), we believe that the latter best represents the relationships among the different variables, as it contains fewer extraneous terms. In fact, all coefficients in Model 2 except for `is_family_category` have a $p$-value less than 0.01. This means that the probability of making a type I error for any of these coefficients (i.e. rejecting the null hypothesis assuming it is true) is less than 1%, providing strong evidence that these coefficients meaningfully improve our ability to predict application success.

In all three models, we observe that `rating` has a positive influence on `log_installs`. Assuming all other variables are held constant, we predict that a one unit increase in consumer rating corresponds to a $14.1$% increase in the number of downloads. Notably, our estimates for the `rating` coefficient in Model 2 are lower than in Model 1, due to omitted-variable bias. As we mentioned in Section 9, including additional omitted variables may drive down our estimates even further.

As predicted, free applications are more successful than paid applications, with free applications receiving 141% more downloads on average. We also notice that age influences application success; increasing the current version by 1% increases the number of downloads by 0.769%, and increasing the number of years since the last update by 1% decreases the number of downloads by 1.08%.

Although the models generally agree with our hypotheses, some of them were incorrect. For instance, we expected size to negatively affect application success, with larger applications being installed less frequently. Surprisingly, the opposite is true; increasing download size by 1% leads to a 0.701% increase in the number of downloads. This could be because size is an indicator of production quality. Additionally, Model 3 shows that the interaction between `rating` and `category` may take precedence over the standalone variables. For example, we observe that the relationship between rating and success is stronger for games than for family applications. Whereas a one unit increase in rating causes a 40.2% increase in downloads for games, the same change causes a 26.0% decrease for family applications.

It is worthwhile to note that none of these models have $R^2$ values that would traditionally be accepted in hard science fields such as physics or molecular biology. However, since our aim is to support the existence of a causal relationship in a social science field, low $R^2$ values may not be problematic. Given the extreme variability between types of applications, the lack of precision in the dependent variable, and other possible external factors, we still find that these models offer a useful description of how consumer rating affects application success. 

\section{Model Limitations}

\subsection{Statistical Limitations}

In the following section, we assess the five assumptions of the classic linear model: independence and identical distributions (I.I.D.), no perfect collinearity, linear conditional expectations, homoskedastic errors, and normally distributed errors.

\subsubsection{I.I.D.}

According to the Kaggle authors, this data set was collected by randomly scraping the Google Play Store. Since no clusters of applications were specifically targeted, we can reasonably use the entire set of applications on the Google Play Store as our reference population. We recognize that applications likely have some degree of interdependence, especially within categories. For example, the success of one application likely has a negative impact on other applications of the same type. Due to the large size of this data set, however, we expect any dependencies to be negligible. We also have reason to believe that the data are identically distributed, as they are drawn from the same population of applications. One could argue that since the Google Play Store changes over time, the distribution also shifts in response. Because the authors make no specific mention of the time frame across which the data was collected, we will assume that they originate from a cross-sectional snapshot of the Google Play Store and that no shifts in the underlying distribution occurred during the sampling process.

\subsubsection{No Perfect Collinearity}

We can immediately conclude that the variables included in our models are not perfectly collinear, as otherwise the regressions above would have failed. We can also assess near perfect collinearity for these variables by observing the robust standard errors returned by the regression model. In general, highly collinear features will have large standard errors. Since the standard errors of the coefficients are small relative to their magnitude, we can reasonably conclude that they are not nearly collinear.

\subsubsection{Linear Conditional Expectations}

To verify the assumption of linear conditional expectations, we seek to show that there is no relationship between the model residuals and the predictors. That is, the model does not systematically underpredict or overpredict in certain regions of the input space. In the figures below, we plot the relationship between the model residuals and metric-scale predictors. The residuals are generally well-centered around zero, although the model seems to underpredict when `rating` and `log_current_version` are high. The last plot shows the model residuals as a function of the model predictions. Here, the model seems to slightly underpredict in the right region and overpredict in the left region. Despite some inconsistencies, we do not find enough evidence to reject the assumption of linear conditional expectations.  

``` {r linear conditional expectations, figures-side, fig.show = "hold", fig.align = "center", out.width = "30%", echo = FALSE}
# Rating
ggplot(data = data_clean, mapping = aes(x = rating, y = resid(model_medium))) +
  geom_point() + stat_smooth()

# Size
ggplot(data = data_clean, mapping = aes(x = log_size, y = resid(model_medium))) +
  geom_point() + stat_smooth()

# Current version
ggplot(data = data_clean, mapping = aes(x = log_current_version, y = resid(model_medium))) +
  geom_point() + stat_smooth()

# Last updated
ggplot(data = data_clean, mapping = aes(x = log_last_updated, y = resid(model_medium))) +
  geom_point() + stat_smooth()

# Current version
ggplot(data = data_clean, mapping = aes(x = predict(model_medium), y = resid(model_medium))) +
  geom_point() + stat_smooth()
```

\subsubsection{Homoskedastic Errors}

When assessing homoskedastic errors, we seek to determine if the variance of the model residuals depends on the predictors. If the homoskedastic assumption is satisfied, then we should observe a lack of relationship (i.e. constant error variance across the input space); conversely, if the data are heteroskedastic then the conditional variance will depend on the predictors (i.e. non-constant error variance across the input space). The plot below is an eyeball test of homoskedasticity, showing the model residuals as a function of the model predictions. We notice that the spread of the residuals is mostly consistent throughout the input space, although the left-hand side is somewhat narrower. As a more concrete assessment, we also perform a Breush-Pagan test with the null hypothesis that there are no heteroskedastic errors in the model. Since the $p$-value falls below a significance threshold of 0.001, we find enough evidence to reject the null hypothesis. In response to this failed assumption, we report robust standard errors (adjusted for heteroskedasticity) instead of non-adjusted errors.

```{r homoskedastic errors, fig.show = "hold", fig.align = "center", out.width = "50%", echo = FALSE}
plot(model_medium, which = 1)
# bptest(model_medium)
```

\subsubsection{Normally Distributed Errors}

Here, we seek to determine if the model residuals are normally distributed. Below, we show a histogram and a Q-Q plot of the model residuals. We notice that the left tail is slightly fatter than expected, while the right tail is slightly thinner than expected. However, because the residuals generally seem to follow a normal distribution, we can reasonably justify this assumption and conclude that our estimates are unbiased.

```{r error distribution plot, figures-side, fig.show = "hold", out.width = "50%", echo = FALSE}
hist(resid(model_medium), main = "Distribution of model residuals", breaks = 20)
plot(model_medium, which = 2)
```

\subsection{Structural Limitations}
The true causal diagram is undoubtedly more complex than the one we have outlined in Section 3. We have identified a few omitted variables that could affect our statistical models, shown in the second causal diagram below. We discuss the relationships these omitted variables have with our existing variables and the ways in which they could bias our results.

\subsubsection{Brand Awareness}
Brand awareness is the measure of how memorable and recognizable a brand is to its target audience. In the context of our research question, we understand this variable as the percentage of consumers that have the ability to download an application from the Google Play store and are familiar with the application brand. We believe brand awareness to be positively correlated with both application success and consumer rating. Greater brand awareness widens the acquisition funnel for applications, leading to more downloads. Organizations that publish highly rated applications likely have more disposable income for marketing and hence better awareness among consumers. Omitting this variable from our models creates a positive bias that pushes the `rating` coefficient away from zero, causing us to overestimate the relationship between consumer rating and application success. 

\subsubsection{Application Rankings}
An internal factor that could affect application success is the business logic defined by Google, specifically how the company ranks applications and lets users explore products. We believe this variable can be operationalized through a numeric value representing an application's rank within its respective category. Application rankings should positively influence success, as we expect consumers to discover highly ranked applications more easily. Assuming Google uses rating as a metric for defining an application's rank, ranking and rating will also be positively correlated. As a result, the omitted-variable bias is positive, pushing the coefficient for `rating` further from zero. 

\subsubsection{Total Addressable Market}
The total addressable market relates to the innate opportunity contained in an application (measured in revenue or in the number of available consumers), which varies according to its functionality. In general, we believe this variable is positively correlated with application success, as larger markets tend to have greater potential to drive downloads. We do not anticipate a strong association with consumer rating, though we do expect a positive one; the bigger the market, the more crucial it is to build a differentiated product that offers quality features. Omitting this variable also creates a positive bias that inflates the coefficient for `rating` and leads us to overestimate its effect on application success. 

![A revised version of the causal diagram including omitted variables.](./pictures/complex_model.PNG){#image2 .class width=50%}

\section{Conclusion}

The goal of this analysis was to investigate the causal factors of application success. Specifically, we sought to assess our hypothesis that consumer rating has a positive influence on application success, as understood by the number of downloads an application receives. Our causal model included a total of five independent variables, though we identified additional omitted variables that could have impacted our results. We used a cross-sectional data set from Kaggle, containing records of 10,000 applications from the Google Play Store in 2019. After exploring, cleaning, and transforming the data, we created three regression models and interpreted their coefficients.

The models we produced confirm that applications with higher consumer ratings are more successful than those with poor ratings. It is important to note, however, that we have not proven the existence of a causal link, but merely provided evidence in favor of one. Whatever the case, it is clear that other factors besides rating also contribute to application success. For instance, we also noticed that free applications, applications with many versions, and large applications perform better than their counterparts. Unfortunately, our models have limited predictive power (adjusted $R^2 = 0.247$), showing that our analysis has room for improvement. Further studies may seek to bring in additional variables in an effort to capture new aspects of the problem. New data could also help mitigate the issues we faced, such as omitted-variable bias, irregularities in the data, and imprecision in the outcome variable. We believe that this study is valuable to application developers, as it identifies key variables that may influence success. Our findings may also be useful to Google and other moderators of mobile applications, who share an interest in understanding the dynamics of consumer environments.

\section{Appendix}

1. Lab 2 [Repository](https://github.com/orenscarmeli/w203-project2-oren-romain-oleg-sam)