-
Notifications
You must be signed in to change notification settings - Fork 33
/
assignment-05.Rmd
168 lines (132 loc) · 7.56 KB
/
assignment-05.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
title: 'Assignment 5: Model selection and multivariate statistics (9 marks)'
output:
html_document:
toc: false
---
*To submit this assignment, upload the full document on blackboard,
including the original questions, your code, and the output. Submit
your assignment as a knitted `.pdf` (preferred) or `.html` file.*
1. In this exercise, we will once again use the data of Santangelo _et al._
(2019) that you used in assignment 4. Let's go ahead and load in the data. See
assignment 4 if you need a refresher on the details of the experiment and the
dataset. (5 marks total)
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(lme4)
library(lmerTest)
Santangelo_data <- "https://uoftcoders.github.io/rcourse/data/Santangelo_JEB_2018.csv"
download.file(Santangelo_data, "Santangelo_JEB_2018.csv")
Santangelo_data <- read_csv("Santangelo_JEB_2018.csv",
col_names = TRUE)
glimpse(Santangelo_data)
head(Santangelo_data)
```
a. Model selection works best when there are no missing values in your
dataset. We will be identifying the best model that predict variation in
flowering time (`Flower.date`) across plants. Create a dataset that excludes
rows where there is missing flowering date data. (0.25 marks)
b. We want to know how HCN, herbivores, and pollinators influence flowering
date. We also think that the effect of herbivores and pollinators on
flowering data might depend on whether the plant is producing HCN. Create a
model that includes fixed-effects that test these predictions. Be sure to
account for variation due to `Genotype`, `Block` and the `Genotype` by
`Herbivory` interactions by including these terms as random effects. This
will be our saturated model. You can ignore `boundary (singular) fit`
warnings that may arise. (1 marks)
c. We will generate a reduced model from the saturated model in (a). Should
we use AIC or AIC~c~. Why? Show your calculation. (0.5 marks)
d. Using the approach described in lecture 11, optimize the random effect
structure of this model. Show the AIC/AIC~c~ output for each model of
varying random effect strucure. Provide a one sentence justification for
each random effect the model, justifying whether it is fixed (i.e., in every
model) or whether some models will drop this effect. Describe in one
sentence what the optimal random effect structure of the model is and why.
(0.5 marks)
e. Using the model with the optimal random-effect structure identified in
(c), find the optimal fixed-effect structure. Be sure to show all the models
and their AIC/AIC~c~ scores. (1.5 mark)
f. Based on the AIC/AIC~c~ output from (d), generate your final model with
both optimized fixed and random effects. Summarize the model and interpret
its output. Is there a significant effect of any treatment? If so, which
one(s) and in which direction. Make a statement about the significant
treatments' effects on flowering date. Use the model's output to support
your answer. You only need to interpret significant main effects here (i.e.
not interactions). (0.75 marks)
g. Do you think we were justified in interpreting a single model? Why or why
not? What alternative approach could we have used? (0.25 marks).
h. Use `dplyr` and `ggplot2` to plot the flowering date of plants by the
_main_ effect that showed a significant effect in the optimized model above.
The figure should show the mean plus and minus a single standard error of
the mean. Suggest one biological interpretation of the pattern you see in
the figure and in the model (i.e. why do you think this would happen). If
there are no significant effects in the model, simply write "There are no
significant effects!". (0.25 marks)
2. During the multivariate statistics lecture, we made use of vector community
and malaria survey data collected by Mbogo _et al._ (2003) to disentangle the
effects of vector abundance, species richness, and composition, on malaria
prevalence (see path diagram in lecture 10 for a reminder of these
relationships). In this exercise, we will complete the analysis of the
strucutral equation model we began building in class. (1.5 marks total)
Here are some relevant snippets of code taken from the lecture notes to get you
started on this exercise.
```{r eval=FALSE}
library(lavaan)
kenya.wide <- read.csv("kenya.wide.csv", header=TRUE, sep=",")
kenya.pca <- kenya.wide %>%
dplyr::select(arabiensis, gambiae, funestus, merus) %>% #choose relevant columns
mutate_all(sqrt) %>% #this is the Hellinger transformation
prcomp(.) #pipe directly into baseR function for PCA
axes <- data.frame(kenya.pca$x)
kenya.wide <- bind_cols(kenya.wide, axes)
kenya.wide$s.abun <- log(kenya.wide$total.abundance)
kenya.wide$s.sr <- log(kenya.wide$SR)
kenya.wide$s.pfpr <- log(kenya.wide$PfPR)
sem02 <- '
# regressions
l.pfpr ~ a*l.sr + b*l.abun + c*PC2
# correlations
l.sr ~~ d*l.abun
PC2 ~~ e*l.sr
# defined parameters
indirect.abun := (a*d) #indirect effect of abundance via SR
indirect.abun2 := (d*e*c) #indirect effect of abundance via PC2
total.abun := b + (a*d) + (d*e*c)
'
```
a. Complete the structural equation model by adding in calculations for the
indirect and total effects of species richness (SR) and composition (PC2) on
malaria prevalence (PfPR). (0.5 marks)
b. Evaluate the model, bootstrapping confidence intervals for path
coefficients with seed #778. Which predictor had the largest _direct_ effect
on malaria prevalence? How about _total_ effect? Briefly explain these
effects in plain english (1 sentence each). (1 mark)
3. In this exercise, we will be investigating the relationship between vector
community structure and another commonly used metric of disease risk --
entomological inoculation rate (EIR). EIR is a measure of the number of bites by
infectious mosquitoes per person per unit time. We will be making use of the
same data from Mbogo _et al._ (2003) as before, only this time, we will start
with the long form data. (2.5 marks total)
```{r eval=FALSE}
kenya.long <- read.csv("kenya.long.csv", header=TRUE, sep=",")
```
This dataset consists of the same information as kenya.wide, with the addition
of one new columns for "EIR".
a. Convert this dataset to the wide format. Fill cells in the wide dataset
with the **relative abundance** of each species, and include the columns
"total abundance" and "EIR" in the final product. (Hint: use xxxx_join to
add the desired columns to the wide dataset after you spread it) (Hint2:
pivot_wider() may be easier to use than spread()) (1 marks)
b. Construct a series of linear models to investigate the relationship
between EIR and i) total mosquito abundance and ii) the abundance of each
species. Interpret the results of these models. (Hint: is EIR a simple
function of total mosquito abundance, or is there a particular species that
is contributing disproportionately to it?) (0.5 mark)
c. Investigate the influence of total abundance and community structure (use
the first two PC axes) on EIR with a strcutural equation model. Include only
direct effects only in this model, and pretend we have reason to believe
total abundance is associated with community composition.
i. Briefly explain the correlation structure you have chosen for your
predictors, total abundance, PC1, and PC2. (0.5 marks)
ii. Evaluate the model. Are these results congruent with your findings
from part (a)? (0.5 marks)