-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathTutorial-Causal-Impact.Rmd
194 lines (128 loc) · 9.53 KB
/
Tutorial-Causal-Impact.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: "Measuring Causal Impact with GA Data - Evaluating the Effects of COVID19 on Hospital Appointments"
output: html_notebook
---
```{r setup, message=FALSE,warning=FALSE,echo=FALSE}
library(CausalImpact)
library(googleAnalyticsR)
library(googleAuthR)
library(tidyverse)
library(zoo)
library(bsts)
library(gt)
gar_auth(email = Sys.getenv("OAUTH_EMAIL")) # Actual email hidden to protect the account
view_id <- Sys.getenv("CAUSAL_INF_VIEW_ID") # Actual GA view ID hidden to protect the account
### References
# BSTS http://www.unofficialgoogledatascience.com/2017/07/fitting-bayesian-structural-time-series.html
# CausalImpact https://google.github.io/CausalImpact/CausalImpact.html
```
# Overview
The [CausalImpact](https://google.github.io/CausalImpact/CausalImpact.html) library measures the effects of an event on a response variable when establishing a traditional control group through a randomized trial is not a viable option. It does this by establishing a 'synthetic control' which serves as a baseline under which the actual data is compared.
In this tutorial, we'll look at the effect that the Corona virus outbreak had on the number of "Make an Appointment" forms completed on a hospital website. To begin, we must establish a "pre-period" before the event occurred and a "post-period" after the event occurred. The pre-period is used to train a Bayesian Structural Time Series model. In the post-period, the model is used to predict our synthetic control which indicates how the outcome may have performed were the event not to have occurred.
Our pre-period will be 10/1/2019 to 3/15/2020 and our post-period will be 3/16/2020 - 5/4/2020. Our predictor variables will be the number of sessions from organic, social, and referral sources. An important assumption made by the CausalImpact library is that our predictors are *not* affected by our event.
# Gathering Data from Google Analytics
First, we must gather the data necessary for our analysis. Our response variable, as established earlier, will be "Make an Appointment" form completions which is the goal1Completions metric in GA. Our predictor variables will come from the channelGrouping dimension in GA.
We know that the hospital suspended paid media around the time of the outbreak so we'll remove traffic from paid sources using the following filter:
```{r eval= FALSE}
channel_filter <- dim_filter(dimension="channelGrouping",operator="REGEXP",expressions="Paid Search|Display",not = T)
```
We call the Google Analytics reporting API twice. Once to gather the goal completion data:
```{r eval=FALSE}
# Gather goal data
df_goals <- google_analytics(viewId = view_id,
date_range = date_range,
metrics = "goal1Completions",
dimensions = c("date"),
dim_filters = my_filter_clause,
max = -1)
```
and once to gather the channel session data:
```{r eval=FALSE}
df_sessions <- google_analytics(viewId = view_id,
date_range = date_range,
metrics = c("sessions"),
dimensions = c("date","channelGrouping"),
max = -1,
dim_filters = my_filter_clause)
```
This avoids us having to aggregate the goal data after pivoting the session data. Pivoting the session data generates multiple columns of data from our single channelGrouping column. Putting this all together is shown below.
```{r message=FALSE, warning=FALSE}
date_range <- c("2019-10-01","2020-05-04")
# Remove paid traffic
channel_filter <- dim_filter(dimension="channelGrouping",operator="REGEXP",expressions="Paid Search|Display",not = T)
my_filter_clause <- filter_clause_ga4(list(channel_filter))
# Gather goal data
df_goals <- google_analytics(viewId = view_id,
date_range = date_range,
metrics = "goal1Completions",
dimensions = c("date"),
dim_filters = my_filter_clause,
max = -1)
# Gather session data
df_sessions <- google_analytics(viewId = view_id,
date_range = date_range,
metrics = c("sessions"),
dimensions = c("date","channelGrouping"),
max = -1,
dim_filters = my_filter_clause) %>%
pivot_wider(id_cols=date,names_from=channelGrouping,values_from=sessions) %>%
mutate_at(vars(-date),~if_else(is.na(.),0,.))
# Merge the goal completion data into the sessions data
df <- df_sessions %>% mutate(y = df_goals$goal1Completions)
```
```{r}
head(df) %>% gt()
```
# Create BSTS Model
The following code creates a Bayesian Structural Time Series model that will be used by the CausalImpact library to generate our synthetic control. It's here that we input our pre-period and post-period as well as our predictor and response variables.
The BSTS package has several options for modifying our model. Here, we apply a "local level" which captures high level trend in the response variable. We also capture the 7-day weekly trend in our data using `AddSeasonal()`.
```{r warning=FALSE,message=FALSE}
df2 <- df # Create copy of our DF so we can re-run after the remove the response data from prediction period
# Assign pre and post periods
pre.period <- c(1,which(df$date == "2020-03-15"))
post.period <- c(which(df$date == "2020-03-15")+1,length(df$date))
post.period.response <- df$y[post.period[1] : post.period[2]]
# Remove outcomes from the post-period. The BSTS model should be ignorant of the values we intend to predict
df2$y[post.period[1] : post.period[2]] <- NA
# Create a zoo object which adds dates to plot output
df_zoo <- read.zoo(df2, format = "%Y-%m-%d")
# Add loacl and seasonal trends
ss <- AddLocalLevel(list(), df_zoo$y)
ss <- AddSeasonal(ss, df_zoo$y, nseasons = 7) # weekly seasonal trend
bsts.model <- bsts(y ~ ., ss, niter = 1000, data = df3_zoo, family = "gaussian", ping=0)
plot(bsts.model)
```
The blue dots are the actual data points and the black line underneath is our estimated posterior distribution. We can see that the model does a reasonable job of predicting form completions, though there are some outliers in late February that are not well predicted. This will increase our uncertainty in our predictions and thus widen our confidence interval (the shading around the black line).
# Generate CausalImpact Analysis
Now that we have our model, we can compare our prediction to what actually happened and measure the impact of the event.
```{r}
impact <- CausalImpact(bsts.model = bsts.model,
post.period.response = post.period.response)
plot(impact)
```
The top plot shows the actual data in black and our predicted distribution of the response variable in blue with the median value as a dashed blue line. The 2nd plot subtracts the predicted data from the actual data to show the difference between the two values. If th effect had no impact, we would expect the pointwise estimated to hover around 0. The last plot shows the cumulative impact of the event over time. Notice how our confidence interval (shown in blue) widens as time goes on.
Our causal impact model confirms a decrease in the number of form completions, however the 95% confidence interval quickly includes 0 which means that we cannot say with certainty that the impact extends into April.
While we weren't able to find conclusive results, being able to measure our certainty is a major benefit of Bayesian models such as this one.
# Causal Impact Report
One nice feature of the CausalImpact library is that it provides a human-friendly read-out of the results. Here they are summarized below.
```{r}
summary(impact, "report")
```
# Validating Our Synthetic Control
One method of validating your model is to generate predictions *before* the event occurred. If our model is well-behaved, we should see little difference between the predicted and actual response data.
```{r}
# Filter to include only pre-event data. Also reorder columns to place y after the date
df_compare <- df %>% filter(date < "2020-02-15") %>% select(date,last_col(),2:length(df))
df_zoo <- read.zoo(df_compare, format = "%Y-%m-%d")
pre.period <- c(index(df_zoo)[1],index(df_zoo)[which(df_compare$date == "2020-01-15")])
post.period <- c(index(df_zoo)[which(df_compare$date == "2020-01-15")+1],index(df_zoo)[length(df_compare$date)])
impact <- CausalImpact(df_zoo, pre.period, post.period)
plot(impact)
```
Above we see that the model doesn't do a great job of predicting the upper spikes of the form completions which likely explains the wide confidence interval seen earlier.
# Comparison to the Naive Approach
Deploying advanced modeling techniques is only useful if there are advantages over much simpler techniques. The naive method would be to use our pre-intervention data to establish an average and continue that average into the post-period to estimate a synthetic control.
Before the event, we had about 19 form fills a day. After, we had 8.5 a day. That's a decrease of about 52%. CausalImpact estimated a decrease in 44% with a 95% confidence interval of 29%-63%. Were these numbers to be substantially different, and we had confidence in our model, we would prefer the figures generated by CausalImpact.
There are some clear cases when modeling will outperform the naive approach described above:
1: If there is a trend in the response variable, then averaging the pre-period will not capture the continuation of that trend.
2. If evaluating the degree of confidence is important, the CausalImpact model is preferable due to its ability to measure uncertainty.