ch_linear-models.Rmd

# Linear Models

```{r setup, include = FALSE}

library(tidyverse)
library(emmeans)
library(broom)
library(knitr)

knitr::opts_chunk$set(echo = TRUE, cache = TRUE)

df_trial <- 
  read_csv(
    file = "data/linear/drug_trial.csv",
    col_types = cols(drug = col_factor(), 
                     pre  = col_double(), 
                     post = col_double(), 
                     sex  = col_factor()))

df_disease <- 
  read_csv(
    file = "data/linear/disease.csv",
    col_types = cols(drug    = col_factor(),
                     disease = col_factor(),
                     y       = col_double()))
```

## Topic 1 (BV)

### Getting Started {-}

## Sums of Squares Types

### Getting Started {-}

To demonstrate the various types of sums of squares, we'll create a data frame called `df_disease` taken from the SAS documentation (__reference__). The summary of the data is shown.

```{r, echo=FALSE}
df_disease %>% head(3)
summary(df_disease)
```

### The Model {-}

For this example, we're testing for a significant difference in `stem_length` using ANOVA.  In R, we're using `lm()` to run the ANOVA, and then using `broom::glance()` and `broom::tidy()` to view the results in a table format.

```{r}
lm_model <- lm(y ~ drug + disease + drug*disease, df_disease)
```

The `glance` function gives us a summary of the model diagnostic values.

```{r}
lm_model %>% glance()
```

The `tidy` function gives a summary of the model results.

```{r}
lm_model %>% tidy()
```


### The Results {-}

You'll see that R print the individual results for each level of the drug and disease interaction.  We can get the combined F table in R using the `anova()` function on the model object.

```{r}
lm_model %>% 
  anova() %>% 
  tidy() %>% 
  kable()
```

And with some extra work, we can get a `Total` row to match the F table output by SAS.

```{r}
lm_model %>%
  anova() %>%
  tidy() %>%
  add_row(term = "Total", df = sum(.$df), sumsq = sum(.$sumsq)) %>% 
  kable()
```

Comparing this to the output in SAS, we see the following.

```{r, eval=FALSE}
proc glm;
   class drug disease;
   model y=drug disease drug*disease / ss1 ss2 ss3 ss4;
run;
```

```{r echo=FALSE, fig.align='center', out.width="90%"}
knitr::include_graphics("images/linear/sas-f-table.png")
```

### Sums of Squares Tables  {-}

In SAS, it is easy to find the tables for the variouse types of sums of squares calculations. Unfortunately, it is not easy to match this output in using functions from base R.  However, the `rstatix` package offers a solution to produce these various sums of squares tables.  Note that there does not appear to be a `Type IV SS` equivalent in R.  

#### Type I

In R,

```{r message=FALSE, warning=FALSE}
df_disease %>% 
  rstatix::anova_test(
    y ~ drug + disease + drug*disease, 
    type = 1, 
    detailed = TRUE) %>% 
  rstatix::get_anova_table() %>% 
  kable()
```

And in SAS, 

```{r, echo=FALSE, fig.align='center', out.width="75%"}
knitr::include_graphics("images/linear/sas-ss-type-1.png")
```


#### Type II {-}

In R,

```{r message=FALSE, warning=FALSE}
df_disease %>% 
  rstatix::anova_test(
    y ~ drug + disease + drug*disease, 
    type = 2, 
    detailed = TRUE) %>% 
  rstatix::get_anova_table() %>% 
  kable()
```

And in SAS, 

```{r, echo=FALSE, fig.align='center', out.width="75%"}
knitr::include_graphics("images/linear/sas-ss-type-2.png")
```


#### Type III {-}

In R,

```{r message=FALSE, warning=FALSE}
df_disease %>% 
  rstatix::anova_test(
    y ~ drug + disease + drug*disease, 
    type = 3, 
    detailed = TRUE) %>% 
  rstatix::get_anova_table() %>% 
  kable()
```

And in SAS, 

```{r, echo=FALSE, fig.align='center', out.width="75%"}
knitr::include_graphics("images/linear/sas-ss-type-3.png")
```

#### Type IV {-}

In SAS,

```{r, echo=FALSE, fig.align='center', out.width="75%"}
knitr::include_graphics("images/linear/sas-ss-type-4.png")
```

In R there is no equivalent operation to the `Type IV` sums of squares calculation in SAS.

## Contrasts

### Getting Started {-}

To demonstrate contrasts, we'll create a data frame called `df_trial`. We see that the `drug` variable has three levels, _A_, _C_ and _E_.

```{r, echo=FALSE}
levels_drug <- levels(df_trial$drug)
summary(df_trial)
glimpse(df_trial)
```

In order to work with these levels as contrasts, we can use one of the pre-existing R functions to create the identity matrix for this factor variable. The `contr.treatment` definition is the default one that R uses, while the `contr.SAS` is the definition that SAS uses by default.

```{r}
contr.treatment(levels_drug)
contr.SAS(levels_drug)
```

R also has the following definitions ready for use: `contr.sum`, `contr.poly`, and `contr.helmert`.


### Using Contrasts in R {-}

There are many ways to work with contrasts in R, but here we're only going to focus on defining contrasts in the modeling function.  In this example, we're creating a linear model to predict post values from the pre and drug values. The only difference between the two is in the contrasts argument.  In one case, we're using the R default, and in the other we're using the SAS default.

```{r}
model_trt <- 
  lm(post ~ pre + drug, data = df_trial,
     contrasts = list(drug = contr.treatment))

model_sas <- 
  lm(post ~ pre + drug, data = df_trial,
     contrasts = list(drug = contr.SAS))
```

Comparing the output for these two models, we see that differ in which drug level is being used as the reference baseline.

```{r}
tidy(model_trt)
tidy(model_sas)
```

We can define a custom contrast as well.  In this case, we are comparing _A_ to _E_, where _E_ is the baseline group. As is typical of contrasts, the values must sum to 0, where negative numbers represent the baseline groups.

```{r, eval=FALSE}
lm(post ~ pre + drug, data = df_trial,
   constrasts = list(drug = c("A" = 1, "C" = "0", "E" = -1)))
```

It is possible to combine contrasts using `cbind`.  Notice here that contrast definitions don't require level names to be assigned, as long as the order of the levels is maintained. 

```{r, eval=FALSE}
contrast_1 <- c(1,  0, -1)
contrast_2 <- c(1, -2,  1)

contrast_c <- cbind(contrast_1, contrast_2)

lm(post ~ pre + drug, data = df_trial,
   contrasts = contrast_c)
```

### Easy Contrasts in R with `emmeans` {-}

The `emmeans` package makes working with contrasts much easier.  In fact, this is the method that we recommend when trying to match contrast output between SAS and R.

We begin by defining some models.

```{r}
model_lm <-  lm(post ~ pre + drug + sex, data = df_trial)
model_av <- aov(post ~ pre + drug + sex, data = df_trial)
model_gm <- glm(post ~ pre + drug + sex, data = df_trial)
```

Then we convert these models to `emmeans` models.  In the `emmeans` function we can specify which variables we want to display estimated marginal means for.

```{r, eval=FALSE}
model_lm %>% emmeans(specs = "drug")
model_av %>% emmeans(specs = "drug", by = "sex")
model_gm %>% emmeans(specs = ~ drug | sex)
```

Or, we can define some common contrasts with the `contrast` function.

```{r, eval=FALSE}
# All Pairwise Comparisons
model_lm %>% 
  emmeans("drug") %>% 
  contrast(method = "pairwise")     

# Treatment v Control Comparison
model_gm %>% 
  emmeans("drug") %>% 
  contrast(method = "trt.vs.ctrl")  
```

We can also control the reference group, or reverse the contrast order using the arguments in the `contrast` function.

```{r, eval=FALSE}
model_lm %>% 
  emmeans("drug") %>% 
  contrast(method = "trt.vs.ctrl", ref = 2)     

model_lm %>% 
  emmeans("drug") %>% 
  contrast(method = "trt.vs.ctrl", rev = T)     
```

Custom contrasts can be defined as well.

```{r, eval=FALSE}
model_lm %>% 
  emmeans("drug") %>% 
  contrast(method = list(
    "A v E"  = c("A" = 1, "C" = 0, "E" = -1),
    "AE v C" = c(1, -2, 1),
    "A"      = c(1, 0, 0)
  ))
```

### Matching Contrasts: R and SAS {-}

It is recommended to use the `emmeans` package when attempting to match contrasts between R and SAS.  In SAS, all contrasts must be manually defined, whereas in R, we have many ways to use pre-existing contrast definitions.   The `emmeans` package makes simplifies this process, and provides syntax that is similar to the syntax of SAS.

This is how we would define a contrast in SAS.

```{r, eval=FALSE}
# In SAS
proc glm data=work.mycsv;
   class drug;
   model post = drug pre / solution;
   estimate 'C vs A'  drug -1  1 0;
   estimate 'E vs CA' drug -1 -1 2;
run;
```

And this is how we would define the same contrast in R, using the `emmeans` package.

```{r, eval=FALSE}
lm(formula = post ~ pre + drug, data = df_trial) %>% 
  emmeans("drug") %>% 
  contrast(method = list(
    "C vs A"  = c(-1,  1, 0),
    "E vs CA" = c(-1, -1, 2)
  ))
```

Note, however, that there are some cases where the scale of the parameter estimates between SAS and R is off, though the test statistics and p-values are identical.  In these cases, we can adjust the SAS code to include a divisor.  As far as we can tell, this difference only occurs when using the predefined Base R contrast methods like `contr.helmert`.

```{r, eval=FALSE}
proc glm data=work.mycsv;
   class drug;
   model post = drug pre / solution;
   estimate 'C vs A'  drug -1  1 0 / divisor = 2;
   estimate 'E vs CA' drug -1 -1 2 / divisor = 6;
run;
```