assignment-04.Rmd

---
title: 'Assignment 4: Exploration, linear and mixed-effects models'
output:
    html_document:
        toc: false
---

```{r setup, echo=FALSE}
knitr::opts_chunk$set(eval = FALSE)
```

*To submit this assignment, upload the full document on blackboard,
including the original questions, your code, and the output. Submit
you assignment as a knitted `.pdf` (prefered) or `.html` file.*

1. Visualization (3 marks)

    Import the tidyverse library. We will be using the same beaver1 dataset that
    we used in last week's assignment.

    ```{r message=FALSE, warning=FALSE}
    library(tidyverse)
    ```

    a. Create a histogram to visualize the distribution of the beavers' body
    temperatures, separating the temp data based on the beaver's activity level
    (after transforming it into a categorical value the way you did for your
    last assignment). Describe the properties of the distributions. When
    creating this plot for the purpose of evaluating temp distribution, what
    argument did you adjust and wny? (1 mark)
```{r eval=FALSE}
# This part is the same as the last assignment
beaverActive <- beaver1 %>%
  mutate(factorActive = factor(activ))

ggplot(beaverActive, aes(x=temp,fill=factorActive)) +
  geom_histogram(binwidth=0.02) # binwidth is the only new part here

# Mention average, range, and skew (or kurtosis - less emphasized) for both 
# activity states (0.5)
# Just need to mention that testing only one binwidth can affect your perception
# of the distribution's properties (0.5)
```

    b. What type of variables are temperature and time of day? With this in
    mind, create a visualization that will help you get a better understanding
    of the relationship between temperature and time. (0.5 mark)
```{r eval=FALSE}
# Answer-y bit
ggplot(beaverActive, aes(x=time,y=temp)) +
  geom_point()

# Continuous (and/or time is independent & temp dependent). Should be a 
# scatterplot with time on the x and temp on the y
```

    c. Create a single box plot that visualizes all the variables in your data
    (includes temperature, activity, day, and activity). (0.5 mark)
```{r eval=FALSE}
# Answer-y bit
# Just looking for the ability to look at "wholesome" data - tease apart
# factors using x/y, colour, faceting
ggplot(beaverActive, aes(x= factor(day) ,y=temp, colour=factorActive)) +
  geom_violin() 

ggplot(beaverActive, aes(x=factor(day) ,y=temp)) +
  geom_boxplot() +
  facet_wrap(~factorActive)
```
    
    d. What is one prediction you might make about the relationships among your
    variables (based on the patterns you observed)? Create a visualization that
    illustrates your prediction, improving on your other plots in at least one
    way. State why this plot is an improvement. (1 mark)
```{r eval=FALSE}
# Answer-y bit
# Anything reasonable, ex: body temperature is correlated with time. (0.5)
# Improvement ideas (0.5)
# Boxplot - add scatterplot in front (+ jitter position)
# Boxplot - facet_wrap could add the "free_x" scaling for day (NOT free_y)
# Scatterplot - add geom_smooth 
# Violin - only if you don't lose a point

ggplot(beaverActive, aes(x= factor(day) ,y=temp, colour=factorActive)) +
  geom_boxplot() +
  geom_point(position="jitter", alpha=0.6)

ggplot(beaverActive, aes(x=factorActive,y=temp, colour=factor(day))) +
  geom_boxplot() +
  facet_wrap(~ factor(day) )
```


2. Unusual Values (1.5 marks)

    Looking at your beaver1 data, consider the prediction you made in 1d.

    a. There are some particularly high/low body temperature measurements. Give
    an example of a systematic or random error (state which) that could have
    influenced these values. (0.5 mark)
```{r eval=FALSE}
# Answer-y bit
# Random: beaver was briefly afraid/stressed for an unrelated reason (which 
# perhaps could have also led to oversleeping, explaining low values?)
# Systematic: if a few different temperature transmitters were used & one 
# became damaged, affecting the precision of its readings.
```

    b. Consider whether these values would affect your ability to test the
    prediction you made for question 1d. Using that plot as a template,
    illustrate the effects of including/excluding these points. (Hint: you may
    want to either create a second data set or get creative with colour.) State 
    whether you would remove the points and why. (1 mark)
```{r eval=FALSE}
# Answer-y bit
# This question is a bit trickier - have to combine the two conditions, made 
# with appropriate cutoff values. Using "data=" at the start of the geoms for
# using separated filtered & unfiltered data is also a bit trickier. (0.5)
noWeirdos <- beaverActive %>%
  filter(temp<37.5,temp>36.4)
ggplot() +
  geom_violin(data=noWeirdos, aes(x=factorActive,y=temp), width=0.5,
               colour="purple", position=position_nudge(x=0.26)) + 
  geom_violin(data=beaverActive, aes(x=factorActive,y=temp), width=0.5,
               colour="green", position=position_nudge(x=-0.26))

ggplot(beaverActive, 
       aes(x=factorActive ,y=temp, colour=(temp>=37.5 | temp<=36.4)) ) +
  geom_boxplot() 

# Should conclude that removing the points isn't necessary. You expect to see 
# some variation, the values don't seem abnormal for a mammal, none of them are
# even particularly far from the median when other factors are considered (day,
# activity). (0.5)
```

3. Generalized Linear Models (3 marks)

    ```{r}
    co2_df <- as_data_frame(as.matrix(CO2)) %>% 
        mutate(conc = as.integer(conc),
               uptake = as.numeric(uptake))
    ```

    a. Look through the help documentation (?CO2) to understand what each
    variable means. Which variable(s) do you think would be the $y$ in the GLM
    model? Which variable(s) would be the $x$? Briefly defend these choices. (1
    mark)

    b. How much does `uptake` change if `conc` goes up by 10 mL/L? (*Note:* it
    is intentional that there is no mention of the other variables in the
    model.) Write out the interpretation as a simple statement of this
    contribution of `conc` on `uptake`, when the other variables are also in the
    model. (2 marks)

    c. Run the following code if you need to download our survey data.
    
        ```{r message=FALSE, warning=FALSE}
        download.file("https://ndownloader.figshare.com/files/2292169", 
            "survey.csv") #if you need to re-download the survey
        survey <- read_csv("survey.csv")
        ```

        Use logistic regression to see if weight significantly predicts sex.
        Make a concluding statement as to indicate whether the model is
        significant and create a plot to visualize the linear model. Run the
        following code to ensure sex is treated as a factor variable:

        ```{r}
        survey$sex <- as.factor(survey$sex)
        ```

        Hint: you need to make sure there are only two levels to this variable:
        `"F"` and `"M"`. (0.5 marks)
        
4. Linear mixed-effects models (4 marks).

    Santangelo _et al._ (2018) were interested in understanding how plant
    defenses, herbivores, and pollinators influence the expression of plant
    floral traits (e.g. flower size). Their experiment had 3 treatments, each
    with 2 levels: Plant defense (2 levels: defended vs. undefended), herbivory
    (2 levels: reduced vs. ambient) and pollination (2 levels: open vs.
    supplemental). These treatments were fully crossed for a total of 8
    treatment combinations. In each treatment combination, they grew 4
    individuals from each of 25 plant genotypes for a total of 800 plants (8
    treatment combinations x 25 genotypes x 4 individuals per genotype). Plants
    were grown in a common garden at the Koffler Scientific Reserve (UofTs field
    research station) and 6 floral traits were measured on all plants throughout
    the summer. We will analyze how the treatments influenced one of these
    traits in this exercise. Run the code chunk below to download the data,
    which includes only a subset of the columns from the full dataset:
    
    ```{r}
    library(tidyverse)
    
    plant_data <- "https://uoftcoders.github.io/rcourse/data/Santangelo_JEB_2018.csv"
    download.file(plant_data, "Santangelo_JEB_2018.csv")
    plant_data <- read_csv("Santangelo_JEB_2018.csv", 
                           col_names = TRUE)
    glimpse(plant_data)
    head(plant_data)
    ```

    You can see that the data contain 792 observations (i.e. plants, 8 died
    during the experiment). There are 50 genotypes across 3 treatments:
    Herbivory, Pollination, and HCN (i.e. hydrogen cyanide, a plant defense).
    There are 6 plant floral traits: Number of days to first flower, banner
    petal length, banner petal width, plant biomass, number of flowers, and
    number of inflorescences. Finally, since plants that are closer in space in
    the common garden may have similar trait expression due to more similar
    environments, the authors included 6 spatial "blocks" to account for this
    environmental variation (i.e. Plant from block A "share" an environment and
    those from block B "share" an environment, etc.). Also keep in mind that
    each treatment combination contains 4 individuals of each genotype, which
    are likely to have similar trait expression due simply to shared genetics.
    
    a. Use the `lme4` and `lmerTest` R packages to run a linear mixed-effects
    model examining how herbivores (`Herbivory`), Pollinators (`Pollination`),
    plant defenses (`HCN`) _and all interactions_ influences the length of
    banner petals (`Avg.Bnr.Wdth`) produced by plants while accounting for
    variation due to spatial block and plant genotype. Also allow the intercept
    for `Genotype` to vary across the levels of the herbivory treatment. (1
    mark: 0.5 for correct fixed effects specification and 0.5 for correct random
    effects structure). You only need to specify the model for this part of the
    question.
    
```{r}
library(lme4)
library(lmerTest)

model <- lmer(Avg.Bnr.Wdth ~ HCN*Herbivory*Pollination + 
                  (1|Block) + (1|Genotype) + (1|Genotype:Herbivory),
              data = plant_data)
```
    

    b. Summarize (i.e. get the output) the model that you ran in part (a). Did
    any of the treatments have a significant effect on banner petal length? If
    so, which ones? Based on your examination of the model output, how can you
    tell which level of the significant treatments resulted in longer or shorter
    mean banner petal widths? Make a statement for each significant **main**
    effects in the model (i.e. not interactions) (0.5 marks).
    
```{r}
summary(model)

# Answer: Supplemental pollination resulted in a reduction in banner petal width.
```


    c. Using `dplyr` and `gglot2`, plot the mean banner width for one of the
    significant interactions in the model above (whichever you choose). The idea
    is to show how both treatments interact to influence the mean length of
    banner petals using a combination of different colours, linetypes, shapes,
    etc. Feel free to use whatever kind of plot that is appropriate to this kind
    of data. Also include 1 standard error around the mean. As a reminder, I
    have included the formula to calculate the standard error of the mean below.
    (1.5 marks). **Bonus**: Avoid overlap in the points in the figure (0.25
    marks).
    
```{r}
plant_data %>% 
    group_by(Herbivory, HCN) %>% 
    summarise(mean = mean(Avg.Bnr.Wdth, na.rm = TRUE),
              sd = sd(Avg.Bnr.Wdth, na.rm = TRUE),
              n = sum(!is.na(Avg.Bnr.Wdth)), # Tough!! Half marks for using n()
              se = sd / sqrt(n)) %>% 
    ggplot(., aes(x = HCN, y = mean, shape = Herbivory, color = Herbivory)) +
    geom_errorbar(aes(ymax = mean + se, ymin = mean - se), width = 0.15,
                  position = position_dodge(width = 0.15)) +
    geom_point(position = position_dodge(width = 0.15)) +
    theme_classic()
```
    

    $$ SE = \frac{sd}{\sqrt{n}}  $$

    d. After accounting for the fixed effects, how much of the variation in
    banner petal length was explained by each of the random effects in the
    model? Show your work (0.5 marks).
    
```{r}
total_var = 0.003088 + 0.067091 + 0.003231 + 0.044998
Genotype = 0.067091 / total_var
Genotype_herb = 0.003088 / total_var
Block = 0.003231 / total_var

Genotype
Genotype_herb
Block
```
    

    e. Descibe the pattern you see in the figure generated in part (c). Why do
    you think the interaction you plotted was significant in the model? Suggest
    one plausible ecological explanation for the observed pattern. (0.5 marks)


# Question Ideas
These could be put in a question bank or something so that future, more creative
people can take them and make them into useful questions?
## Visualization
Think of any one change you could make to the last plot that would improve your
ability to understand the relationship between time and temperature. Make the
adjustment and plot again below. (0.5 mark)
```{r eval=FALSE}
# Just factor in any of the other variables from the data set (day, activity), 
# likely either by colouring the values that way or perhaps by faceting. Could
# also add a geom_smooth to the scatterplot.
ggplot(beaverActive, aes(x=time,y=temp,colour= factor(day) )) +
  geom_point()

ggplot(beaverActive, aes(x=time,y=temp)) +
  geom_point() +
  facet_wrap(~ factor(day), scales="free_x" )

ggplot(beaverActive, aes(x=time,y=temp, colour=factorActive)) +
  geom_point()

ggplot(beaverActive, aes(x=time,y=temp, colour=factorActive)) +
  geom_point() + 
  geom_smooth()
```
Create a box plot that will help you understand whether patterns in your data
might offer some support this prediction: "activity is a better predictor of
body temperature than day" (0.5 mark)
```{r eval=FALSE}
# Answer-y bit
# Just looking for the ability to tease apart two factors (activity, day) using
# x & colour (or faceting)
ggplot(beaverActive, aes(x= factor(day) ,y=temp, colour=factorActive)) +
  geom_violin() +
  geom_point(position="jitter", alpha=0.6)

ggplot(beaverActive, aes(x= factor(day) ,y=temp)) +
  geom_boxplot() +
  facet_wrap(~factorActive)
```