forked from UofTCoders/rcourse
-
Notifications
You must be signed in to change notification settings - Fork 0
/
assignment-04.Rmd
executable file
·341 lines (279 loc) · 13.8 KB
/
assignment-04.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
---
title: 'Assignment 4: Exploration, linear and mixed-effects models'
output:
html_document:
toc: false
---
```{r setup, echo=FALSE}
knitr::opts_chunk$set(eval = FALSE)
```
*To submit this assignment, upload the full document on blackboard,
including the original questions, your code, and the output. Submit
you assignment as a knitted `.pdf` (prefered) or `.html` file.*
1. Visualization (3 marks)
Import the tidyverse library. We will be using the same beaver1 dataset that
we used in last week's assignment.
```{r message=FALSE, warning=FALSE}
library(tidyverse)
```
a. Create a histogram to visualize the distribution of the beavers' body
temperatures, separating the temp data based on the beaver's activity level
(after transforming it into a categorical value the way you did for your
last assignment). Describe the properties of the distributions. When
creating this plot for the purpose of evaluating temp distribution, what
argument did you adjust and wny? (1 mark)
```{r eval=FALSE}
# This part is the same as the last assignment
beaverActive <- beaver1 %>%
mutate(factorActive = factor(activ))
ggplot(beaverActive, aes(x=temp,fill=factorActive)) +
geom_histogram(binwidth=0.02) # binwidth is the only new part here
# Mention average, range, and skew (or kurtosis - less emphasized) for both
# activity states (0.5)
# Just need to mention that testing only one binwidth can affect your perception
# of the distribution's properties (0.5)
```
b. What type of variables are temperature and time of day? With this in
mind, create a visualization that will help you get a better understanding
of the relationship between temperature and time. (0.5 mark)
```{r eval=FALSE}
# Answer-y bit
ggplot(beaverActive, aes(x=time,y=temp)) +
geom_point()
# Continuous (and/or time is independent & temp dependent). Should be a
# scatterplot with time on the x and temp on the y
```
c. Create a single box plot that visualizes all the variables in your data
(includes temperature, activity, day, and activity). (0.5 mark)
```{r eval=FALSE}
# Answer-y bit
# Just looking for the ability to look at "wholesome" data - tease apart
# factors using x/y, colour, faceting
ggplot(beaverActive, aes(x= factor(day) ,y=temp, colour=factorActive)) +
geom_violin()
ggplot(beaverActive, aes(x=factor(day) ,y=temp)) +
geom_boxplot() +
facet_wrap(~factorActive)
```
d. What is one prediction you might make about the relationships among your
variables (based on the patterns you observed)? Create a visualization that
illustrates your prediction, improving on your other plots in at least one
way. State why this plot is an improvement. (1 mark)
```{r eval=FALSE}
# Answer-y bit
# Anything reasonable, ex: body temperature is correlated with time. (0.5)
# Improvement ideas (0.5)
# Boxplot - add scatterplot in front (+ jitter position)
# Boxplot - facet_wrap could add the "free_x" scaling for day (NOT free_y)
# Scatterplot - add geom_smooth
# Violin - only if you don't lose a point
ggplot(beaverActive, aes(x= factor(day) ,y=temp, colour=factorActive)) +
geom_boxplot() +
geom_point(position="jitter", alpha=0.6)
ggplot(beaverActive, aes(x=factorActive,y=temp, colour=factor(day))) +
geom_boxplot() +
facet_wrap(~ factor(day) )
```
2. Unusual Values (1.5 marks)
Looking at your beaver1 data, consider the prediction you made in 1d.
a. There are some particularly high/low body temperature measurements. Give
an example of a systematic or random error (state which) that could have
influenced these values. (0.5 mark)
```{r eval=FALSE}
# Answer-y bit
# Random: beaver was briefly afraid/stressed for an unrelated reason (which
# perhaps could have also led to oversleeping, explaining low values?)
# Systematic: if a few different temperature transmitters were used & one
# became damaged, affecting the precision of its readings.
```
b. Consider whether these values would affect your ability to test the
prediction you made for question 1d. Using that plot as a template,
illustrate the effects of including/excluding these points. (Hint: you may
want to either create a second data set or get creative with colour.) State
whether you would remove the points and why. (1 mark)
```{r eval=FALSE}
# Answer-y bit
# This question is a bit trickier - have to combine the two conditions, made
# with appropriate cutoff values. Using "data=" at the start of the geoms for
# using separated filtered & unfiltered data is also a bit trickier. (0.5)
noWeirdos <- beaverActive %>%
filter(temp<37.5,temp>36.4)
ggplot() +
geom_violin(data=noWeirdos, aes(x=factorActive,y=temp), width=0.5,
colour="purple", position=position_nudge(x=0.26)) +
geom_violin(data=beaverActive, aes(x=factorActive,y=temp), width=0.5,
colour="green", position=position_nudge(x=-0.26))
ggplot(beaverActive,
aes(x=factorActive ,y=temp, colour=(temp>=37.5 | temp<=36.4)) ) +
geom_boxplot()
# Should conclude that removing the points isn't necessary. You expect to see
# some variation, the values don't seem abnormal for a mammal, none of them are
# even particularly far from the median when other factors are considered (day,
# activity). (0.5)
```
3. Generalized Linear Models (3 marks)
```{r}
co2_df <- as_data_frame(as.matrix(CO2)) %>%
mutate(conc = as.integer(conc),
uptake = as.numeric(uptake))
```
a. Look through the help documentation (?CO2) to understand what each
variable means. Which variable(s) do you think would be the $y$ in the GLM
model? Which variable(s) would be the $x$? Briefly defend these choices. (1
mark)
b. How much does `uptake` change if `conc` goes up by 10 mL/L? (*Note:* it
is intentional that there is no mention of the other variables in the
model.) Write out the interpretation as a simple statement of this
contribution of `conc` on `uptake`, when the other variables are also in the
model. (2 marks)
c. Run the following code if you need to download our survey data.
```{r message=FALSE, warning=FALSE}
download.file("https://ndownloader.figshare.com/files/2292169",
"survey.csv") #if you need to re-download the survey
survey <- read_csv("survey.csv")
```
Use logistic regression to see if weight significantly predicts sex.
Make a concluding statement as to indicate whether the model is
significant and create a plot to visualize the linear model. Run the
following code to ensure sex is treated as a factor variable:
```{r}
survey$sex <- as.factor(survey$sex)
```
Hint: you need to make sure there are only two levels to this variable:
`"F"` and `"M"`. (0.5 marks)
4. Linear mixed-effects models (4 marks).
Santangelo _et al._ (2018) were interested in understanding how plant
defenses, herbivores, and pollinators influence the expression of plant
floral traits (e.g. flower size). Their experiment had 3 treatments, each
with 2 levels: Plant defense (2 levels: defended vs. undefended), herbivory
(2 levels: reduced vs. ambient) and pollination (2 levels: open vs.
supplemental). These treatments were fully crossed for a total of 8
treatment combinations. In each treatment combination, they grew 4
individuals from each of 25 plant genotypes for a total of 800 plants (8
treatment combinations x 25 genotypes x 4 individuals per genotype). Plants
were grown in a common garden at the Koffler Scientific Reserve (UofTs field
research station) and 6 floral traits were measured on all plants throughout
the summer. We will analyze how the treatments influenced one of these
traits in this exercise. Run the code chunk below to download the data,
which includes only a subset of the columns from the full dataset:
```{r}
library(tidyverse)
plant_data <- "https://uoftcoders.github.io/rcourse/data/Santangelo_JEB_2018.csv"
download.file(plant_data, "Santangelo_JEB_2018.csv")
plant_data <- read_csv("Santangelo_JEB_2018.csv",
col_names = TRUE)
glimpse(plant_data)
head(plant_data)
```
You can see that the data contain 792 observations (i.e. plants, 8 died
during the experiment). There are 50 genotypes across 3 treatments:
Herbivory, Pollination, and HCN (i.e. hydrogen cyanide, a plant defense).
There are 6 plant floral traits: Number of days to first flower, banner
petal length, banner petal width, plant biomass, number of flowers, and
number of inflorescences. Finally, since plants that are closer in space in
the common garden may have similar trait expression due to more similar
environments, the authors included 6 spatial "blocks" to account for this
environmental variation (i.e. Plant from block A "share" an environment and
those from block B "share" an environment, etc.). Also keep in mind that
each treatment combination contains 4 individuals of each genotype, which
are likely to have similar trait expression due simply to shared genetics.
a. Use the `lme4` and `lmerTest` R packages to run a linear mixed-effects
model examining how herbivores (`Herbivory`), Pollinators (`Pollination`),
plant defenses (`HCN`) _and all interactions_ influences the length of
banner petals (`Avg.Bnr.Wdth`) produced by plants while accounting for
variation due to spatial block and plant genotype. Also allow the intercept
for `Genotype` to vary across the levels of the herbivory treatment. (1
mark: 0.5 for correct fixed effects specification and 0.5 for correct random
effects structure). You only need to specify the model for this part of the
question.
```{r}
library(lme4)
library(lmerTest)
model <- lmer(Avg.Bnr.Wdth ~ HCN*Herbivory*Pollination +
(1|Block) + (1|Genotype) + (1|Genotype:Herbivory),
data = plant_data)
```
b. Summarize (i.e. get the output) the model that you ran in part (a). Did
any of the treatments have a significant effect on banner petal length? If
so, which ones? Based on your examination of the model output, how can you
tell which level of the significant treatments resulted in longer or shorter
mean banner petal widths? Make a statement for each significant **main**
effects in the model (i.e. not interactions) (0.5 marks).
```{r}
summary(model)
# Answer: Supplemental pollination resulted in a reduction in banner petal width.
```
c. Using `dplyr` and `gglot2`, plot the mean banner width for one of the
significant interactions in the model above (whichever you choose). The idea
is to show how both treatments interact to influence the mean length of
banner petals using a combination of different colours, linetypes, shapes,
etc. Feel free to use whatever kind of plot that is appropriate to this kind
of data. Also include 1 standard error around the mean. As a reminder, I
have included the formula to calculate the standard error of the mean below.
(1.5 marks). **Bonus**: Avoid overlap in the points in the figure (0.25
marks).
```{r}
plant_data %>%
group_by(Herbivory, HCN) %>%
summarise(mean = mean(Avg.Bnr.Wdth, na.rm = TRUE),
sd = sd(Avg.Bnr.Wdth, na.rm = TRUE),
n = sum(!is.na(Avg.Bnr.Wdth)), # Tough!! Half marks for using n()
se = sd / sqrt(n)) %>%
ggplot(., aes(x = HCN, y = mean, shape = Herbivory, color = Herbivory)) +
geom_errorbar(aes(ymax = mean + se, ymin = mean - se), width = 0.15,
position = position_dodge(width = 0.15)) +
geom_point(position = position_dodge(width = 0.15)) +
theme_classic()
```
$$ SE = \frac{sd}{\sqrt{n}} $$
d. After accounting for the fixed effects, how much of the variation in
banner petal length was explained by each of the random effects in the
model? Show your work (0.5 marks).
```{r}
total_var = 0.003088 + 0.067091 + 0.003231 + 0.044998
Genotype = 0.067091 / total_var
Genotype_herb = 0.003088 / total_var
Block = 0.003231 / total_var
Genotype
Genotype_herb
Block
```
e. Descibe the pattern you see in the figure generated in part (c). Why do
you think the interaction you plotted was significant in the model? Suggest
one plausible ecological explanation for the observed pattern. (0.5 marks)
# Question Ideas
These could be put in a question bank or something so that future, more creative
people can take them and make them into useful questions?
## Visualization
Think of any one change you could make to the last plot that would improve your
ability to understand the relationship between time and temperature. Make the
adjustment and plot again below. (0.5 mark)
```{r eval=FALSE}
# Just factor in any of the other variables from the data set (day, activity),
# likely either by colouring the values that way or perhaps by faceting. Could
# also add a geom_smooth to the scatterplot.
ggplot(beaverActive, aes(x=time,y=temp,colour= factor(day) )) +
geom_point()
ggplot(beaverActive, aes(x=time,y=temp)) +
geom_point() +
facet_wrap(~ factor(day), scales="free_x" )
ggplot(beaverActive, aes(x=time,y=temp, colour=factorActive)) +
geom_point()
ggplot(beaverActive, aes(x=time,y=temp, colour=factorActive)) +
geom_point() +
geom_smooth()
```
Create a box plot that will help you understand whether patterns in your data
might offer some support this prediction: "activity is a better predictor of
body temperature than day" (0.5 mark)
```{r eval=FALSE}
# Answer-y bit
# Just looking for the ability to tease apart two factors (activity, day) using
# x & colour (or faceting)
ggplot(beaverActive, aes(x= factor(day) ,y=temp, colour=factorActive)) +
geom_violin() +
geom_point(position="jitter", alpha=0.6)
ggplot(beaverActive, aes(x= factor(day) ,y=temp)) +
geom_boxplot() +
facet_wrap(~factorActive)
```