-
Notifications
You must be signed in to change notification settings - Fork 33
/
lec09-model-selection.Rmd
487 lines (411 loc) · 23.8 KB
/
lec09-model-selection.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
---
title: "Model selection and model averaging"
author: "James Santangelo"
date: '2018-10-03'
output: html_document
---
## Lesson preamble
> ### Learning objectives
>
> - Learn the benefits of model selection and how it differs from traditional inferential statistics
> - Understand the use of AIC and AIC~c~
> - Use AIC~c~ to perform model selection on the RIKZ data
> - Use AIC~c~ to perform model selection and model averaging on a more complicated ecological dataset
>
> ### Lesson outline
>
> Total lesson time: 2 hours
>
> - Brief intro to model selection (10 min)
> - Understanding AIC and AIC~c~ (20 mins)
> - Model selection of RIKZ dataset (30 mins)
> - Model selection and model averaging of more complicated ecological data (60 mins)
>
> ### Setup
>
> - `install.packages('dplyr')` (or `tidyverse`)
> - `install.packages('ggplot2')` (or `tidyverse`)
> - `install.packages("lme4")`
> - `install.packages("lmerTest")`
> - `install.packages("MuMIn")`
```{r include = FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
## Introduction to model selection
Up to now, when faced with a biological question, we have formulated a null
hypothesis, generated a model to test the null hypothesis, summarized the model
to get the value of the test-statistic (e.g. _t_-statistic, _F_-value, etc.),
and rejected the null hypothesis when the observed test statistic falls outside
the test statistic distribution with some arbitrarily low probability (e.g. _P_
< 0.05). This low probability then allows us to reject the null hypothesis in
favour of the more biologically interesting alternative hypothesis. This is an
**inferential** or **frequentist** approach to statistics.
An alternative approach is to simultaneously test multiple competing hypotheses,
with each hypothesis being represented in a separate model. This is what model
selection allows and it is becoming increasingly used in ecology and
evolutionary biology. It has a number of advantages:
1. It does not rely on a single model.
2. Models can be ranked and weighted according to their fit to the observed data.
3. The best supported models can be averaged to get parameter estimates
The most challenging part of model selection is coming up with a series of
hypothesis-driven models that that adequately capture the processes and patterns
you are interested in representing. In an ideal world, this would be based on
detailed knowledge of the system you are working in and on prior work or
literature reviews. However, detailed knowledge is often unavailable for many
ecological systems and alternative approaches exist. For example, we can compare
models with all possible combinations of the predictors of interest (AKA the
**all-subset** approach) rather than constructing models with only particular
combinations of those predictors. Note this approach has been criticized as
"data-dredging" or "fishing" (Burnham and Anderson 2002, but see Symonds and
Moussalli 2011) and is nicely summarized by this often-quoted line from Burnham
and Anderson (2012).
_"“Let the computer find out” is a poor strategy and usually reflects the fact that the researcher did not bother to think clearly about the problem of interest and its scientific setting (Burnham and Anderson, 2002)."_
The next step is to decide how we select the "best" model or set of best models.
One approach would be to use a measure of model fit or explanatory power. For
example, we could use the model R^2^, which we covered in lecture 9 and
represents the amount of variation in our response variable that is explained by
the predictor variables in the model. However, this is not a parsimonious
solution since it inevitably favours more complex models (i.e. models with more
predictors). Thus, what we really need is an approach that examines model fit
while simultaneously penalizing model complexity.
## Information Theory and model selection criteria
There are numerous model selection criteria based on mathematical information
theory that we can use to select models from among a set of candidate models.
They additionally allow the relative weights of different models to be compared
and allow multiple models to be used for inferences. The most commonly used
information criteria in ecology and evolution are: Akaike's Information
Criterion (AIC), the corrected AIC~c~ (corrected for small sample sizes), and
the Bayesian Information Criterion (BIC, also known as the Schwarz Criterion)
(Johnson and Omland, 2004). Here we will focus on AIC and AIC~c~. Here is how
AIC is calculated:
$$ AIC = -2Log\mathcal{L} \ + \ 2p $$
The lower the AIC value, the better the model. $-2Log\mathcal{L}$ is called the
**negative log likelihood** of the model, and measures the model's fit (or lack
thereof) to the observed data: **Lower** negative log-likelihood values indicate
a beter fit of the model to the observed data. $2p$ is a _bias correcting
factor_ that penalizes the model AIC based on the number of parameters (p) in
the model.
Similar to AIC is AIC~c~, which corrects for small sample sizes. It is
recommended to use AIC~c~ when $n/k$ is less than 40, with $n$ being the sample
size (i.e. total number of observations) and $k$ being the **total** number of
parameters in the most saturated model (i.e. the model with the most
parameters), including both fixed and random effects, as well as intercepts
(Symonds and Moussalli 2011). AIC~c~ is calculated as follows:
$$ AIC_c = AIC+\frac{2k(k+1)}{n-k-1} $$
## Example of model selection
In lecture 8, we used mixed-effect models to determine whether species richness
was influenced by NAP (i.e. the height of a sampling location relative to mean
tidal level) across 9 beaches. We generated 3 models: (1) A random intercept
model with NAP as a fixed effect and the random effect allowing the intercept
(i.e species richness) to vary by beach; (2) A random intercept and slope model
with NAP as a fixed effect and a random effect allowing both the intercept (i.e.
richness) and slope (i.e. response of richness to NAP) to vary across beaches;
and (3) An intercept only model with no fixed effects but allowing for variation
in richness across beaches. In this lecture, we'll create similar models with
random intercepts and random intercepts + slopes but we'll additionally include
Exposure and the NAP by Exposure interaction as additional fixed-effects. Let's
load in the RIKZ data.
```{r}
# Load in packages used throughout lesson
library(tidyverse)
library(lme4)
library(lmerTest)
library(MuMIn)
```
```{r, eval=FALSE}
# Load in data
rikz_data <- "https://uoftcoders.github.io/rcourse/data/rikz_data.txt"
download.file(rikz_data, "rikz_data.txt")
rikz_data <- read_delim("rikz_data.txt",
col_names = TRUE,
delim = "\t")
rikz_data$Beach <- as.factor(rikz_data$Beach)
```
```{r, echo=FALSE}
rikz_data <- read_delim("data/rikz_data.txt",
col_names = TRUE,
delim = "\t")
```
```{r}
rikz_data$Beach <- as.factor(rikz_data$Beach)
```
Given these 3 models, we may be interested in knowing which one best fits our
observed data so that we can interpret this model to draw inferences about our
population. To do this, we can follow the guidelines laid out in Zuur _et al._
(2009):
1. Create a saturated model that includes all fixed effects (and their
interactions) and random effects. If you can't include all fixed effects, you
should select those that you think are most likely to be important based on your
knowledge of the system at hand.
2. Using the saturated model, optimize the random-effect structure of the model.
Compare models with saturated fixed effects structure with models of differing
random effect structure. Models should be fit using Restricted Maximum
Likelihood (i.e. `REML = TRUE`). The optimal random effect structure is the one
that provides the lowest AIC. Note that some common sense is needed here: you
should not remove random effects if they are included to specifically account
for non-independence in your data (i.e. nestedness, see lecture 8)
3. Optimize the fixed-effect structure by fitting the model with optimized
random effects to models with differing fixed effect structures. These models
should be fit with Maximum Likelihood (i.e. `REML = FALSE`) to prevent biased
fixed-effect parameter estimates. Models can be selected on the basis of AIC
(lowest is best) or by comparing nested models using Likelihood Ratio Tests
(LRTs). **Important**: You cannot include models that contain interactions if
the main effects involved in the interactiond are not present in the model.
4. Run the final model with optimized fixed and random effects using REML.
Note that this approach to model selection can also be applied to models that
lack random effects (e.g. simple linear regression). In such cases, you don't
need to worry about random effects and can go ahead and just optimize the fixed
effects. You also don't need to worry about ML vs. REML.
Let's try this with some real data. Let's create 2 models, but this time let's
include Exposure and the interaction between NAP and Exposure as additional
effects.
```{r}
# Define 2 models. Fit both with REML.
mixed_model_IntOnly <- lmer(Richness ~ NAP*Exposure + (1|Beach), REML = TRUE,
data = rikz_data)
mixed_model_IntSlope <- lmer(Richness ~ NAP*Exposure + (1 + NAP|Beach), REML = TRUE,
data = rikz_data)
```
#### Step 1: Create saturated model
This is already done. The saturated model is `mixed_model_IntSlope` created in
the code chunk above.
#### Step 2: Optimize random-effect structure
To optimize the random effects, we compare the `mixed_model_IntSlope` with the
`mixed_model_IntOnly`. This will determine whether including a random slope for
each beach improves the fit of the model to the observed data. **Note:** We are
not testing the `mixed_model_IntOnly` model against one in which there is no
random effect since including a random intercept for each beach is required to
account for the non-independence in our data. Let's get the AIC~c~ for these two
models below. We will use AIC~c~ since $n/k$ is equal to `nrow(rikz_data)/3 =
1.67`, which is below 40.
```{r}
AICc(mixed_model_IntOnly, mixed_model_IntSlope)
```
Based on the output above, it seems including a random intercept only is a beter
fit to the data (i.e. lower AIC~c~). The optimal random-effect structure is thus
one that includes only a random intercept for each beach but does **not**
include a random slope.
#### Step 3: Optimize the fixed effect structure
We now need to refit the model with the optimal random-effect structure using ML
and compare different fixed effect structures. Let's fit these models below and
check their AIC~c~s.
```{r}
# Full model with both fixed effects and their interaction
mixed_model_IntOnly_Full <- lmer(Richness ~ NAP*Exposure + (1|Beach), REML = FALSE,
data = rikz_data)
# No interaction
mixed_model_IntOnly_NoInter <- lmer(Richness ~ NAP + Exposure + (1|Beach),
REML = FALSE,
data = rikz_data)
# No interaction or main effect of exposure
mixed_model_IntOnly_NAP <- lmer(Richness ~ NAP + (1|Beach),
REML = FALSE,
data = rikz_data)
# No interaction or main effect of NAP
mixed_model_IntOnly_Exp <- lmer(Richness ~ Exposure + (1|Beach),
REML = FALSE,
data = rikz_data)
# No fixed effects
mixed_model_IntOnly_NoFix <- lmer(Richness ~ 1 + (1|Beach),
REML = FALSE,
data = rikz_data)
AICc(mixed_model_IntOnly_Full, mixed_model_IntOnly_NoInter,
mixed_model_IntOnly_NAP, mixed_model_IntOnly_Exp,
mixed_model_IntOnly_NoFix)
```
Based on the output above, it looks like the model that includes NAP, Exposure,
and their interaction provides the best fit to the data.
#### Step 4: Interpret model output
Summarizing the output, we see that increasing both NAP and Exposure results in
a decrease in species richness (_P_ < 0.05). There is also a nearly significant
interaction between NAP and Exposure (I wouldn't interpret this since _P_ >
0.05). Finally, while Beach is included in our model as a random effect, notice
how little variation is attributed to differences between beaches relative to
the model we ran in lecture 8. The only difference is that our current model
includes Exposure as a fixed effect. This suggests that much of the variation
between beaches in lecture 8 was likely attributable to differences in exposure,
which is now being captured by the fixed effects.
```{r}
# Summarize best-fit model
summary(update(mixed_model_IntOnly_Full, REML = TRUE))
```
## A more realistic example
In Assignment 5, you used data from Santangelo _et al._ (2019) who were
interested in understanding how insect herbivores and plant defenses influence
the expression of plant floral traits. While that was one component of the
study, the main question was whether herbivores, pollinators, and plant defenses
alter the shape and strength of _natural selection_ on plant floral traits. In
other words, which of these 3 agents of selection (plant defenses, herbivores,
or pollinators) are most important in driving the evolution of floral traits in
plants?
The motivation for that experiment actually came a few year prior, in 2016, when
Thompson and Johnson (2016) published an experiment examining how plant defenses
alter natural selection on plant floral traits. They found some interesting
patterns but it was unclear whether these were being driven by the plant's
interactions with herbivores, pollinators, or both. This was because they didn't
directly manipulate these agents: pollination was not quantified in their
experiment and herbivore damage was measured observationally and thus these
results were correlative. However, their experimental data provides a prime use
of model selection in ecology and evolution.
The data consists of 140 observations (rows). Each row in the dataset
corresponds to the mean trait value of one plant genotype (they had replicates
for each genotype but took the average across these replicates) grown in a
common garden. They measured 8 traits and quantified the total mass of seeds
produced by plants as a measure of absolute fitness. Genotypes were either
"cyanogenic" (i.e. containing plant defenses) or were "acyanogenic" (i.e.
lacking plant defenses). In addition, they quantified the amount of herbivore
damage (i.e. percent leaf area lost) on each plant twice throughout the growing
season, although here we will only focus on the effects of plant defenses and
avoid their herbivore damage measurements. We are going to conduct a **genotypic
selection analysis** to quantify natural selection acting on each trait (while
controlling for other traits) and assess whether plant defenses alter the
strength or direction of natural selection imposed on these traits. While this
may sound complicated, it turns out that a single multiple regression is all
that's required to do this: relative fitness is the response variable and
_standardized_ traits (i.e. mean of 0 and standard deviation of 1), treatments,
and their interactions are the predictors. This multiple regression approach is
a common way of measuring natural selection in nature (see Lande and Arnold
1983, Rausher 1992, Stinchcombe 2002).
Let's start by loading in the data.
```{r, eval=FALSE}
# Load in Thompson and Johnson (2016) data
Thompson_data <- "https://uoftcoders.github.io/rcourse/data/Thompson-Johnson_2016_Evol.csv"
download.file(Thompson_data, "Thompson-Johnson_2016_Evol.csv")
Thompson_data <- read_csv("Thompson-Johnson_2016_Evol.csv",
col_names = TRUE)
```
```{r, echo=FALSE}
Thompson_data <- read_csv("data/Thompson-Johnson_2016_Evol.csv",
col_names = TRUE)
glimpse(Thompson_data)
```
We will now generate the global model. Remember, this should be a saturated model with all of the fixed effects and their interactions. We are including the presence of hydrogen cyanide (HCN, cyanide in model below), all standardized traits and the trait by HCN interactions as fixed effects in this model. There are no random effects in this model so we can go ahead and use `lm()`.
```{r}
# Create saturated model
GTSelnModel.HCN <- lm(RFSeed ~ VegGrowth.S*cyanide + BnrLgth.S*cyanide +
BnrWdt.S*cyanide + FrstFlwr.S*cyanide +
InflNum.S*cyanide + FlwrCt.S*cyanide +
InflWdt.S*cyanide + InflHt.S*cyanide,
data = Thompson_data)
```
Next, we will perform our model selection based on AIC~c~ (due to low sample
sizes). We automate this process using the `dredge()` function from the `MuMIn`
package. `dredge()` offers a **ton** of flexibility in how model selection is
done. You can customize the criterion used (i.e. AIC, BIC, etc.), how the output
is reported, what's included in the output (e.g. do you want F-stats and R^2^ to
be included?), whether some terms should be represented in all models and even
only include some terms in models if other terms are included (AKA *Dependency
Chain*). For our purposes, we will perform an **all-subsets** model selection,
comparing models with all combinations of predictors (but not those where main
effects are absent despite the presence of an interaction!). I warned earlier
that this approach has been criticized. However, in this case it's reasonable:
we know from work in other systems that all of these traits could conceivably
experience selection, and we know that that selection could vary due to plant
defenses. In other words, all terms in this model represent biologically real
hypotheses. Let's go ahead and dredge.
```{r}
options(na.action = "na.fail") # Required for dredge to run
GTmodel_dredge <- dredge(GTSelnModel.HCN, beta = F, evaluate = T, rank = AICc)
options(na.action = "na.omit") # set back to default
```
Let's have a look at the first few lines returned by `dredge()`. Let's also
print out how many models were compared.
```{r}
head(GTmodel_dredge)
nrow(GTmodel_dredge)
```
The output tells us the original model and then provides a rather large table
with many rows and columns. The rows in this case are different models with
different combinations of predictors (n = 6,817 models). The columns are the
different terms from our model, which `dredge()` has abbreviated. The numbers in
the cells are the estimates (i.e. beta coefficients) for each term that is
present in the model; blank cells mean that term was not included in the model.
The last 5 columns are important: they give us the degrees of freedom for the
model (a function of the number of terms in the model), the log-likelihood of
the model, the AIC score, the delta AIC, and the AIC weights. The delta AIC is
the difference between the AIC score of a model and the AIC score of the top
model. The weight can be thought of as the probability that the model is the
best model given the candidate set included in the model selection procedure.
Given this output, we may be interested in retrieving the top model and
interpreting it. Let's go ahead and to this. We can retrieve the top model using
the `get.models()` function and specifying that we want to top model using the
`subset` argument. We need to further subset this output since it returns a
list.
```{r}
top_model <- get.models(GTmodel_dredge, subset = 1)[[1]]
top_model
```
This output above shows us the top model from our dredging. What if we want to
interpret this model? No problem!
```{r}
# Summarize top model
summary(top_model)
```
But how much evidence do we actually have that this is the **best** model? We
have over 6,000 models so it's unlikely that only one model explains the data.
From the `dredge` output we can see there is little difference in the AIC and
weights of the first few models. Is there really much of a difference between
two models who's AIC differ by only 0.14 points? How do we decide which model(s)
to interpret? Statisticians have thought about this problem and it turns out
that models with delta AIC (or other criterion) less than 2 are considered to be
just as good as the top model and thus we shouldn't just discount them.
Alternatively, we could use the weights: if a model has weight greater or equal
to 95% then it is likely to be the top model. Otherwise we can generate a
"credibility" set consisting of all models whose cumulative sum of AIC weights
is 0.95. In any case, the point is that we have no good reason to exclude models
other than the top one when the next models after it are likely to be just as
good. To get around this, we can perform what's called **model averaging** (AKA
multi-model inference), which allows us to average the parameter estimates
across multiple models and avoids the issue of model uncertainty. Let's do this
below by averaging all models with a delta AIC <= 2.
```{r}
summary(model.avg(GTmodel_dredge, subset = delta <= 2))
```
The first part of the output breaks down which terms are part of which models
and gives some nice descriptive statistics for these models. The next part is
the important bit: the actual parameter estimates from the model averaging. The
estimates are those that were averaged across all models with a delta AIC <= 2.
Note there are two sets of estimates: the "full" coefficients set terms to 0 if
they are not included in the model while averaging, whereas the "conditional"
coefficients ignores the predictors entirely. The "full" coefficients are thus
more conservative and it is best practice to interpret these. Finally, the last
part of the output tells us in how many models each of the terms was included.
### Caveats to model selection
1. Depends on the models included in the candidate set. You can't identify a
model as being the "best" fit to the data if you didn't include the model to
begin with!
2. The parameter estimates and predictions arising from the "best" model or set
of best models should be biologically meaningful.
3. Need to decide whether to use model selection or common inferential
statistics (e.g. based on _P_-values). Techniques that rely on both approaches
are possible (e.g. backward variable selection followed by averaging of top
models), such as the example provided above.
### Formal test between two models
Throughout the early parts of the lecture, we made qualitative decisions on
which model was best based on model AIC scores. However, it is also possible to
statistically test whether one model fits the data better using a _likelihood
ratio test_. This test compares the goodness of fit of two models by testing the
ratio of their log-likelihoods against a Chi-squared distribution. Importantly,
this approach requires that the two models be nested (i.e., one model must be a
subset of the other). This can be implemented in R using the `anova(model1,
model2)` syntax.
## Additional reading
1. Johnson, J. and Omland, K. 2004. Model selection in ecology and evolution.
_Trends in Ecology and Evolution_ **19**: 101-108.
2. Burnham K.P., Anderson D.R. 2002. Model selection and multimodel inference,
2nd edn. *Springer*, New York.
3. Symonds, M. and Moussalli, A. 2011. A brief guide to model selection,
multimodel inference and model averaging in behavioural ecology using Akaike's
information criterion. _Behavioural Ecology and Sociobiology_ **65**: 13-21.
4. 1. Zuur, A. *et al.* 2009. Mixed effects models and extensions in ecology
with R. *Springer*
5. Thompson, K.A. and Johnson, M.T.J. 2016. Antiherbivore defenses alter natural
selection on plant reproductive traits. _Evolution_ **70**: 796-810.
6. Lande, R. and Arnold, S. 1983. The measurement of selection on correlated
characters. _Evolution_ **37**: 1210-1226.
7. Rausher, M. 1992. The measurement of selection on quantitative traits: biases
due to environmental covariances between traits and fitness. _Evolution_ **46**:
616-626.
8. Stinchcombe, J.R. _et al._. 2002. Testing for environmentally induced bias in
phenotypic estimates of natural selection: theory and practice. _Am. Nat._
**160**: 511-523.