-
Notifications
You must be signed in to change notification settings - Fork 493
/
91-appendixA.Rmd
executable file
·270 lines (192 loc) · 13 KB
/
91-appendixA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
\cleardoublepage
```{r echo=FALSE, results='asis'}
if(!is_latex_output()){
cat("# (APPENDIX) Appendix {-}")
} else {
cat("\\appendix")
}
```
# Statistical Background {#appendixA}
```{r setup_appA, include=FALSE}
chap <- "A"
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
opts_chunk$set(
tidy = FALSE,
out.width = '\\textwidth',
fig.height = 4,
warning = FALSE
)
options(scipen = 99, digits = 3)
# Set random number generator see value for replicable pseudorandomness.
set.seed(76)
```
```{r, message=FALSE, warning=FALSE, echo=FALSE}
# Packages needed internally, but not in text.
library(scales)
library(tidyverse)
library(ggrepel)
```
## Basic statistical terms {#appendix-stat-terms}
Note that all the following statistical terms apply only to *numerical* variables, except the *distribution* which can exist for both numerical and categorical variables.
### Mean
The *mean* is the most commonly reported measure of center. It is commonly called the *average* though this term can be a little ambiguous. The mean is the sum of all of the data elements divided by how many elements there are. If we have $n$ data points, the mean is given by:
$$Mean = \frac{x_1 + x_2 + \cdots + x_n}{n}$$
### Median
The median is calculated by first sorting a variable's data from smallest to largest. After sorting the data, the middle element in the list is the *median*. If the middle falls between two values, then the median is the mean of those two middle values.
### Standard deviation and variance {#appendix-sd-variance}
We will next discuss the *standard deviation* ($sd$) of a variable. The formula can be a little intimidating at first but it is important to remember that it is essentially a measure of how far we expect a given data value will be from its mean:
$$sd = \sqrt{\frac{(x_1 - Mean)^2 + (x_2 - Mean)^2 + \cdots + (x_n - Mean)^2}{n - 1}}$$
The *variance* of a variable is merely the standard deviation squared.
$$variance = sd^2 = \frac{(x_1 - Mean)^2 + (x_2 - Mean)^2 + \cdots + (x_n - Mean)^2}{n - 1}$$
### Five-number summary
The *five-number summary* consists of five summary statistics: the minimum, the first quantile AKA 25th percentile, the second quantile AKA median or 50th percentile, the third quantile AKA 75th, and the maximum. The five-number summary of a variable is used when constructing boxplots, as seen in Section \@ref(boxplots).
The quantiles are calculated as
- first quantile ($Q_1$): the median of the first half of the sorted data
- third quantile ($Q_3$): the median of the second half of the sorted data
The *interquartile range (IQR)*\index{interquartile range (IQR)} is defined as $Q_3 - Q_1$ and is a measure of how spread out the middle 50% of values are. The IQR corresponds to the length of the box in a boxplot.
The median and the IQR are not influenced by the presence of outliers in the ways that the mean and standard deviation are. They are, thus, recommended for skewed datasets. We say in this case that the median and IQR are more *robust to outliers*.
### Distribution
The *distribution* of a variable shows how frequently different values of a variable occur. Looking at the visualization of a distribution can show where the values are centered, show how the values vary, and give some information about where a typical value might fall. It can also alert you to the presence of outliers.
Recall from Chapter \@ref(viz) that we can visualize the distribution of a numerical variable using binning in a histogram and that we can visualize the distribution of a categorical variable using a barplot.
### Outliers
*Outliers* correspond to values in the dataset that fall far outside the range of "ordinary" values. In the context of a boxplot, by default they correspond to values below $Q_1 - (1.5 \cdot IQR)$ or above $Q_3 + (1.5 \cdot IQR)$.
## Normal distribution {#appendix-normal-curve}
Let's next discuss one particular kind of distribution: \index{distribution!normal} *normal distributions*. Such bell-shaped distributions are defined by two values: (1) the *mean* $\mu$ ("mu") which locates the center of the distribution and (2) the *standard deviation* $\sigma$ ("sigma") which determines the variation of the distribution. In Figure \@ref(fig:normal-curves), we plot three normal distributions where:
1. The solid normal curve has mean $\mu = 5$ \& standard deviation $\sigma = 2$.
1. The dotted normal curve has mean $\mu = 5$ \& standard deviation $\sigma = 5$.
1. The dashed normal curve has mean $\mu = 15$ \& standard deviation $\sigma = 2$.
```{r normal-curves, echo=FALSE, fig.cap="Three normal distributions.", purl=FALSE, out.width="90%"}
all_points <- tibble(
domain = seq(from = -10, to = 25, by = 0.01),
`mu = 5, sigma = 2` = dnorm(x = domain, mean = 5, sd = 2),
`mu = 5, sigma = 5` = dnorm(x = domain, mean = 5, sd = 5),
`mu = 15, sigma = 2` = dnorm(x = domain, mean = 15, sd = 2)
) %>%
gather(key = "Distribution", value = "value", - domain) %>%
mutate(
Distribution = factor(
Distribution,
levels = c("mu = 5, sigma = 2",
"mu = 5, sigma = 5",
"mu = 15, sigma = 2")
)
)
for_labels <- all_points %>%
filter(between(domain, 3.795, 3.805) & Distribution == "mu = 5, sigma = 2" |
between(domain, 0.005, 0.0105) & Distribution == "mu = 5, sigma = 5" |
between(domain, 16.005, 16.015) & Distribution == "mu = 15, sigma = 2")
all_points %>%
ggplot(aes(x = domain, y = value, linetype = Distribution)) +
geom_line() +
geom_label_repel(data = for_labels, aes(label = Distribution),
nudge_x = c(-1, -2.1, 1)) +
theme_light() +
scale_linetype_manual(values=c("solid", "dotted", "longdash")) +
theme(
axis.title.y = element_blank(),
axis.title.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "none"
)
```
Notice how the solid and dotted line normal curves have the same center due to their common mean $\mu$ = 5. However, the dotted line normal curve is wider due to its larger standard deviation of $\sigma$ = 5. On the other hand, the solid and dashed line normal curves have the same variation due to their common standard deviation $\sigma$ = 2. However, they are centered at different locations.
When the mean $\mu$ = 0 and the standard deviation $\sigma$ = 1, the normal distribution has a special name. It's called the *standard normal distribution* or the *$z$-curve*\index{distribution!standard normal}.
Furthermore, if a variable follows a normal curve, there are *three rules of thumb* we can use:
1. 68% of values will lie within $\pm$ 1 standard deviation of the mean.
1. 95% of values will lie within $\pm$ 1.96 $\approx$ 2 standard deviations of the mean.
1. 99.7% of values will lie within $\pm$ 3 standard deviations of the mean.
Let's illustrate this on a standard normal curve in Figure \@ref(fig:normal-rule-of-thumb). The dashed lines are at -3, -1.96, -1, 0, 1, 1.96, and 3. These 7 lines cut up the x-axis into 8 segments. The areas under the normal curve for each of the 8 segments are marked and add up to 100%. For example:
1. The middle two segments represent the interval -1 to 1. The shaded area above this interval represents 34% + 34% = 68% of the area under the curve. In other words, 68% of values.
1. The middle four segments represent the interval -1.96 to 1.96. The shaded area above this interval represents 13.5% + 34% + 34% + 13.5% = 95% of the area under the curve. In other words, 95% of values.
1. The middle six segments represent the interval -3 to 3. The shaded area above this interval represents 2.35% + 13.5% + 34% + 34% + 13.5% + 2.35% = 99.7% of the area under the curve. In other words, 99.7% of values.
```{r normal-rule-of-thumb, echo=FALSE, fig.cap="Rules of thumb about areas under normal curves.", purl=FALSE, out.width = "80%"}
shade_3_sd <- function(x) {
y <- dnorm(x, mean = 0, sd = 1)
y[x <= -3 | x >= 3] <- NA
return(y)
}
shade_2_sd <- function(x) {
y <- dnorm(x, mean = 0, sd = 1)
y[x <= -1.96 | x >= 1.96] <- NA
return(y)
}
shade_1_sd <- function(x) {
y <- dnorm(x, mean = 0, sd = 1)
y[x <= -1 | x >= 1] <- NA
return(y)
}
labels <- tibble(
x = c(-3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5),
label = c("0.15%", "2.35%", "13.5%", "34%", "34%", "13.5%", "2.35%", "0.15%")
) %>%
mutate(y = rep(0.3, times = n()))
ggplot(data = tibble(x = c(-4, 4)), aes(x)) +
geom_text(data = labels, aes(y=y, label = label)) +
# Trace normal curve
stat_function(fun = dnorm, args = list(mean = 0, sd = 1), n = 1000) +
# Shade and delineate +/- 3 SD
stat_function(fun = shade_3_sd, geom = "area", fill = "black", alpha = 0.25, n = 1000) +
# annotate(geom = "segment", x = c(3, -3), xend = c(3, -3), y = 0, yend = dnorm(3, mean = 0, sd = 1)) +
# Shade and delineate +/- 2 SD
stat_function(fun = shade_2_sd, geom = "area", fill = "black", alpha = 0.25, n = 1000) +
# annotate(geom = "segment", x = c(1.96, -1.96), xend = c(1.96, -1.96), y = 0, yend = dnorm(1.96, mean = 0, sd = 1)) +
# Shade and delineate +/- 1 SD
stat_function(fun = shade_1_sd, geom = "area", fill = "black", alpha = 0.25, n = 1000) +
# annotate(geom = "segment", x = c(1, -1), xend = c(1, -1), y = 0, yend = dnorm(1, mean = 0, sd = 1)) +
geom_vline(xintercept = c(-3, -1.96, -1, 0, 1, 1.96, 3), linetype = "dashed", alpha = 0.5) +
# Axes
scale_x_continuous(breaks = seq(from = -3, to = 3, by = 1)) +
labs(x = "z", y = "") +
theme(axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank())
```
\vspace{0.1in}
```{block, type='learncheck', purl=FALSE}
\vspace{-0.15in}
**_Learning check_**
\vspace{-0.1in}
```
<!--
v2 TODO: Consider LC using this later on: <https://gallery.shinyapps.io/dist_calc/>
-->
Say you have a normal distribution with mean $\mu = 6$ and standard deviation $\sigma = 3$.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What proportion of the area under the normal curve is less than 3? Greater than 12? Between 0 and 12?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is the 2.5th percentile of the area under the normal curve? The 97.5th percentile? The 100th percentile?
```{block, type='learncheck', purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```
\vspace{0.1in}
## log10 transformations {#appendix-log10-transformations}
At its simplest, log10 transformations return base 10 *logarithms*. For example, since $1000 = 10^3$, running `log10(1000)` returns `3` in R. To undo a log10 transformation, we raise 10 to this value. For example, to undo the previous log10 transformation and return the original value of 1000, we raise 10 to the power of 3 by running `10^(3) = 1000` in R. \index{log transformations}
Log transformations allow us to focus on changes in *orders of magnitude*. In other words, they allow us to focus on *multiplicative changes* instead of *additive ones*. Let's illustrate this idea in Table \@ref(tab:logten) with examples of prices of consumer goods in 2019 US dollars.
<!--
v2 TODO: Add text like:
We can also frame such changes as being relative percentage increases/decreases
instead of absolute increases/decreases.
-->
```{r logten, echo=FALSE}
tibble(Price = c(1,10,100,1000,10000,100000,1000000)) %>%
mutate(
`log10(Price)` = log10(Price),
Price = dollar(Price),
`Order of magnitude` = c("Singles", "Tens", "Hundreds", "Thousands", "Tens of thousands", "Hundreds of thousands", "Millions"),
`Examples` = c("Cups of coffee", "Books", "Mobile phones", "High definition TVs", "Cars", "Luxury cars and houses", "Luxury houses")
) %>%
kable(
digits = 3,
caption = "log10 transformed prices, orders of magnitude, and examples",
booktabs = TRUE,
linesep = ""
) %>%
kable_styling(font_size = ifelse(is_latex_output(), 10, 16),
latex_options = c("hold_position"))
```
Let's make some remarks about log10 transformations based on Table \@ref(tab:logten):
1. When purchasing a cup of coffee, we tend to think of prices ranging in single dollars, such as \$2 or \$3. However, when purchasing a mobile phone, we don't tend to think of their prices in units of single dollars such as \$313 or \$727. Instead, we tend to think of their prices in units of hundreds of dollars like \$300 or \$700. Thus, cups of coffee and mobile phones are of different *orders of magnitude* in price.
1. Let's say we want to know the log10 transformed value of \$76. This would be hard to compute exactly without a calculator. However, since \$76 is between \$10 and \$100 and since log10(10) = 1 and log10(100) = 2, we know log10(76) will be between 1 and 2. In fact, log10(76) is 1.880814.
1. log10 transformations are *monotonic*, meaning they preserve orders. So if Price A is lower than Price B, then log10(Price A) will also be lower than log10(Price B).
1. Most importantly, increments of one in log10-scale correspond to *relative multiplicative changes* in the original scale and not *absolute additive changes*. For example, increasing a log10(Price) from 3 to 4 corresponds to a multiplicative increase by a factor of 10: \$100 to \$1000.