forked from rdpeng/rprogdatascience
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsimulation.Rmd
235 lines (159 loc) · 7.59 KB
/
simulation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
# Simulation
```{r,echo=FALSE}
knitr::opts_chunk$set(comment = NA, prompt = TRUE, collapse = TRUE, fig.path = "images/")
set.seed(10)
```
## Generating Random Numbers
[Watch a video of this section](https://youtu.be/tzz4flrajr0)
Simulation is an important (and big) topic for both statistics and for a variety of other areas where there is a need to introduce randomness. Sometimes you want to implement a statistical procedure that requires random number generation or sampling (i.e. Markov chain Monte Carlo, the bootstrap, random forests, bagging) and sometimes you want to simulate a system and random number generators can be used to model random inputs.
R comes with a set of pseuodo-random number generators that allow you to simulate from well-known probability distributions like the Normal, Poisson, and binomial. Some example functions for probability distributions in R
- `rnorm`: generate random Normal variates with a given mean and standard deviation
- `dnorm`: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points)
- `pnorm`: evaluate the cumulative distribution function for a Normal distribution
- `rpois`: generate random Poisson variates with a given rate
For each probability distribution there are typically four functions available that start with a "r", "d", "p", and "q". The "r" function is the one that actually simulates randon numbers from that distribution. The other functions are prefixed with a
- `d` for density
- `r` for random number generation
- `p` for cumulative distribution
- `q` for quantile function (inverse cumulative distribution)
If you're only interested in simulating random numbers, then you will likely only need the "r" functions and not the others. However, if you intend to simulate from arbitrary probability distributions using something like rejection sampling, then you will need the other functions too.
Probably the most common probability distribution to work with the is the Normal distribution (also known as the Gaussian). Working with the Normal distributions requires using these four functions
```r
dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)
```
Here we simulate standard Normal random numbers with mean 0 and standard deviation 1.
```{r}
## Simulate standard Normal random numbers
x <- rnorm(10)
x
```
We can modify the default parameters to simulate numbers with mean 20 and standard deviation 2.
```{r}
x <- rnorm(10, 20, 2)
x
summary(x)
```
If you wanted to know what was the probability of a random Normal variable of being less than, say, 2, you could use the `pnorm()` function to do that calculation.
```{r}
pnorm(2)
```
You never know when that calculation will come in handy.
## Setting the random number seed
When simulating any random numbers it is essential to set the *random number seed*. Setting the random number seed with `set.seed()` ensures reproducibility of the sequence of random numbers.
For example, I can generate 5 Normal random numbers with `rnorm()`.
```{r}
set.seed(1)
rnorm(5)
```
Note that if I call `rnorm()` again I will of course get a different set of 5 random numbers.
```{r}
rnorm(5)
```
If I want to reproduce the original set of random numbers, I can just reset the seed with `set.seed()`.
```{r}
set.seed(1)
rnorm(5) ## Same as before
```
In general, you should **always set the random number seed when conducting a simulation!** Otherwise, you will not be able to reconstruct the exact numbers that you produced in an analysis.
It is possible to generate random numbers from other probability distributions like the Poisson. The Poisson distribution is commonly used to model data that come in the form of counts.
```{r}
rpois(10, 1) ## Counts with a mean of 1
rpois(10, 2) ## Counts with a mean of 2
rpois(10, 20) ## Counts with a mean of 20
```
## Simulating a Linear Model
[Watch a video of this section](https://youtu.be/p7kSSSsv4ms)
Simulating random numbers is useful but sometimes we want to simulate values that come from a specific *model*. For that we need to specify the model and then simulate from it using the functions described above.
Suppose we want to simulate from the following linear model
\[
y = \beta_0 + \beta_1 x + \varepsilon
\]
where $\varepsilon\sim\mathcal{N}(0,2^2)$. Assume $x\sim\mathcal{N}(0,1^2)$, $\beta_0=0.5$ and $\beta_1=2$. The variable `x` might represent an important predictor of the outcome `y`. Here's how we could do that in R.
```{r}
## Always set your seed!
set.seed(20)
## Simulate predictor variable
x <- rnorm(100)
## Simulate the error term
e <- rnorm(100, 0, 2)
## Compute the outcome via the model
y <- 0.5 + 2 * x + e
summary(y)
```
We can plot the results of the model simulation.
```{r Linear Model}
plot(x, y)
```
What if we wanted to simulate a predictor variable `x` that is binary instead of having a Normal distribution. We can use the `rbinom()` function to simulate binary random variables.
```{r}
set.seed(10)
x <- rbinom(100, 1, 0.5)
str(x) ## 'x' is now 0s and 1s
```
Then we can procede with the rest of the model as before.
```{r Linear Model Binary}
e <- rnorm(100, 0, 2)
y <- 0.5 + 2 * x + e
plot(x, y)
```
We can also simulate from *generalized linear model* where the errors are no longer from a Normal distribution but come from some other distribution. For examples, suppose we want to simulate from a Poisson log-linear model where
\[
Y \sim Poisson(\mu)
\]
\[
\log \mu = \beta_0 + \beta_1 x
\]
and $\beta_0=0.5$ and $\beta_1=0.3$. We need to use the `rpois()` function for this
```{r}
set.seed(1)
## Simulate the predictor variable as before
x <- rnorm(100)
```
Now we need to compute the log mean of the model and then exponentiate it to get the mean to pass to `rpois()`.
```{r Poisson Log-Linear Model}
log.mu <- 0.5 + 0.3 * x
y <- rpois(100, exp(log.mu))
summary(y)
plot(x, y)
```
You can build arbitrarily complex models like this by simulating more predictors or making transformations of those predictors (e.g. squaring, log transformations, etc.).
## Random Sampling
[Watch a video of this section](https://youtu.be/-7GA10KWDJg)
The `sample()` function draws randomly from a specified set of (scalar) objects allowing you to sample from arbitrary distributions of numbers.
```{r}
set.seed(1)
sample(1:10, 4)
sample(1:10, 4)
## Doesn't have to be numbers
sample(letters, 5)
## Do a random permutation
sample(1:10)
sample(1:10)
## Sample w/replacement
sample(1:10, replace = TRUE)
```
To sample more complicated things, such as rows from a data frame or a list, you can sample the indices into an object rather than the elements of the object itself.
Here's how you can sample rows from a data frame.
```{r}
library(datasets)
data(airquality)
head(airquality)
```
Now we just need to create the index vector indexing the rows of the data frame and sample directly from that index vector.
```{r}
set.seed(20)
## Create index vector
idx <- seq_len(nrow(airquality))
## Sample from the index vector
samp <- sample(idx, 6)
airquality[samp, ]
```
Other more complex objects can be sampled in this way, as long as there's a way to index the sub-elements of the object.
## Summary
- Drawing samples from specific probability distributions can be done with "r" functions
- Standard distributions are built in: Normal, Poisson, Binomial, Exponential, Gamma, etc.
- The `sample()` function can be used to draw random samples from arbitrary vectors
- Setting the random number generator seed via `set.seed()` is critical for reproducibility