Add session 4 ggplot

DTU-advR · Nov 1, 2018 · 940b4ac · 940b4ac
1 parent f3a9f72
commit 940b4ac
Show file tree

Hide file tree

Showing 4 changed files with 724 additions and 0 deletions.
diff --git a/04_ggplot/exercises/ggplot_exercises.Rmd b/04_ggplot/exercises/ggplot_exercises.Rmd
@@ -0,0 +1,76 @@
+---
+title: "Exercises: Visualising data (ggplot)"
+output: github_document
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+```
+
+Ok, so now it is your turn!
+
+_Remember, for the following exercises, inspiration for the code is available in the slides for this session_
+
+Since we now know how to load data and we are to work with how we manipulate (wrangle) data, now would be a good time for you to make your first script (small program).
+
+- In the upper right corner of RStudio, there is a small icon looking like a piece of paper with a green plus on it. Click it and choose the first option "R Script" (You can also use the short cut as described, on a mac it is command+shift+n). This will open a new empty text file in the RStudio editor, which we can put code into, save it and run it
+
+- A recipe for a script could look like the following. Copy/paste the code below into your new empty script file (Note, the '#' means the line is ignored, so we use this for commenting our code)
+
+```{r, message=FALSE}
+# Clear workspace
+# ------------------------------------------------------------------------------
+rm(list=ls())
+
+# Load libraries
+# ------------------------------------------------------------------------------
+library('tidyverse')
+
+# Load data (session 2 - readr)
+# ------------------------------------------------------------------------------
+
+# Wrangle data (session 3 - dplyr)
+# ------------------------------------------------------------------------------
+
+# Visualise data (session 4 - ggplot)
+# ------------------------------------------------------------------------------
+
+# Write data (session 2 - readr)
+# ------------------------------------------------------------------------------
+
+```
+
+- Click the save icon and save your new script as "my_script.R"
+
+- In RStudio in the icon line just above your script file, there is an icon again looking like a piece of paper, with a blue arrow and the word "Source". Click "Source", this will run each of the lines in the script (ignoring lines beginning with '#'), so for now, it will simply clear the workspace of any variables and load the Tidyverse library every time you "Source" the script
+
+### Question 1
+
+1. In the slides for this dplyr session, find where we load the kommune data directly from the web and write the command in the Console (Hint: All readr read-a-file-functions, start with 'read_')
+2. In the slides for the readr session, find where we write a data set as a tab-separated-values file and save the kommune data to disk as 'kommune_data.tsv', write the command in the Console (Hint: All readr write-a-file-functions, start with 'write_'
+3. In your script under 'Load data' write the command for loading your data file 'kommune_data.tsv' from disk and save it in a variable called `my_data` (Hint: This is completely analogue to reading a file from the web)
+4. Source the script and in the Console, simply write `my_data` and hit return
+
+__Q1__ How many rows and columns are in the kommune data set?
+
+### Question 2
+
+1. Under 'Wrangle data' in your script, using the `mutate()` function write the command for calculating a new variable `inc_exp_ratio`, which is the ratio between `Indt_indkskat` and `Folkeskudg_elev`, i.e. `Indt_indkskat` divided by `Folkeskudg_elev` (Remember to save the result to your `my_data` variable)
+2. In the Concole, write `my_data` and hit return
+
+__Q2__ What is the value of this new variable for 'Koebenhavn'?
+
+### Question 3
+
+1. Under 'Wrangle data' in your script, using the `filter()` function, write the command for identifying all the municipalities, where the value of your new variable is larger than 1
+2. Under 'Wrangle data' in your script, using the `filter()` function, write the command for identifying all the municipalities, where more than half have a long eduction and less than 1 in 5 pupils attend private scool
+
+__Q3A__ How many municipalities have a ratio larger than 1?
+
+__Q3B__ In how many municipalities do more than half have a long eduction and less than 1 in 5 pupils attend private scool?
+
+### Question 4
+
+1. Under 'Wrangle data' in your script, using the `group_by()`, `summarise()` and `arrange` functions, write the command for identifying calculating the average % of students attending private school stratified on `Region` and sort them be falling values (largest first, smallest last)
+
+__Q4__ What is the order of Regions?
diff --git a/04_ggplot/lecture/figures/pie_chart.png b/04_ggplot/lecture/figures/pie_chart.png
diff --git a/04_ggplot/lecture/ggplot_presentation.Rmd b/04_ggplot/lecture/ggplot_presentation.Rmd
@@ -0,0 +1,236 @@
+---
+title: "Visualising data (ggplot)"
+subtitle: "CBio Thursday, session 4"
+author: "Leon Eyrich Jessen & Johannes Eichler Waage"
+date: "November 1st 2018"
+output:
+  ioslides_presentation:
+    widescreen: true
+    smaller: true
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = FALSE)
+set.seed(978552)
+library('tidyverse')
+library('readxl')
+```
+
+# Before we move into ggplot
+
+## Wide vs long data format
+
+```{r, echo=TRUE}
+wide_data = tibble(samples = str_c('smpl_', 1:5), var_1 = round(rnorm(5),2),
+                   var_2 = round(rnorm(5),2), var_3 = round(rnorm(5),2))
+wide_data
+```
+
+Recall, I told you that each row is an observation and each column is a variable? Well...
+
+## There is another Skywalker (Sorry)
+
+We can convert the wide format to what is known as long format like so:
+
+```{r, echo=TRUE}
+long_data = wide_data %>% gather(key = var, value = value, -samples)
+long_data
+```
+
+## Long format
+
+This is particular useful if you have longitudinal data, as it allows you to flatten the data from a cube with time slices to a matrix. Example:
+
+```{r, echo=TRUE}
+wide_data_t1 = tibble(samples = str_c('smpl_', 1:3), var_1 = round(rnorm(3),2),
+                      var_2 = round(rnorm(3),2))
+wide_data_t2 = tibble(samples = str_c('smpl_', 1:3), var_1 = round(rnorm(3),2),
+                      var_2 = round(rnorm(3),2))
+wide_data_t3 = tibble(samples = str_c('smpl_', 1:3), var_1 = round(rnorm(3),2),
+                      var_2 = round(rnorm(3),2))
+```
+
+Add a `time point` variable
+
+```{r, echo=TRUE}
+wide_data_t1 = wide_data_t1 %>% mutate(time_point = 1)
+wide_data_t2 = wide_data_t2 %>% mutate(time_point = 2)
+wide_data_t3 = wide_data_t3 %>% mutate(time_point = 3)
+```
+
+Bind the three data frames and convert to long format
+
+```{r, echo=TRUE}
+long_data_t = bind_rows(wide_data_t1, wide_data_t2, wide_data_t3) %>%
+  gather(key = var, value = value, -samples, -time_point)
+```
+
+## Long format
+
+```{r, echo=TRUE}
+long_data_t
+```
+
+## Long format
+
+Depending on what you want to visualise, these long-to-wide or wide-to-long conversion, may have to be used.
+
+So, let us just convert the `long_data_t` back to wide format, so you can see how that is done:
+
+```{r, echo=TRUE}
+long_data_t %>% spread(key = var, value = value)
+```
+
+## Summary - Long and wide data
+
+- Convert from wide to long: `gather()`
+- Convert from long to wide: `spread()`
+
+See previous slides for examples.
+
+Moving on to data visualisation...
+
+# Data Visualisation using ggplot
+
+## Cliché: A picture says more than 1,000 numbers
+
+Except
+
+## Cliché: A picture says more than 1,000 numbers
+
+Except pie charts... This, this is the only valid use of a pie chart I have seen:
+
+```{r, out.width = "600px", fig.align="center"}
+knitr::include_graphics("figures/pie_chart.png")
+```
+
+## What is `ggplot`?
+
+- The 'gg' in `ggplot` stands for Grammar-of-Graphics.
+
+- "A grammar of graphics is a tool that enables us to concisely describe the components of a graphic" [Hadley Wickham. A layered grammar of graphics. Journal of Computational and Graphical Statistics, vol. 19, no. 1, pp. 3–28, 2010.]((http://vita.had.co.nz/papers/layered-grammar.html)) 
+
+- So, a structured framework for building graphical representations of data
+
+- Let us start with some examples
+
+## Example data for data visualisation 
+
+We will use the `ToothGrowth` data set "The Effect of Vitamin C on Tooth Growth in Guinea Pigs", 60 observations of 3 variables:
+
+-	`len`, numeric, Tooth length
+- `supp`, factor, Supplement type (VC = ascorbic acid or OJ = Orange Juice)
+- `dose`, numeric, Dose in milligrams/day
+
+```{r, echo=TRUE}
+ToothGrowth %>% head(5)
+```
+
+## Example data for data visualisation
+
+We can use the `count()` function to investigate how the 60 Guniea Pigs are distributed in the `supp` and `dose` groups:
+
+```{r, echo=TRUE}
+ToothGrowth %>% count(supp, dose)
+```
+
+## A basic barchart
+
+We can visulise the counts using a simple barchart:
+
+```{r, echo=TRUE, out.width = "600px", fig.align="center"}
+ToothGrowth %>% count(dose, supp) %>%
+  ggplot(aes(x = dose, y = n, fill = supp)) +
+  geom_col(position = 'dodge')
+```
+
+## A basic scatterplot
+
+```{r, echo=TRUE, out.width = "600px", fig.align="center"}
+ToothGrowth %>%
+  ggplot(mapping = aes(x = dose, y = len)) +
+  geom_point()
+```
+
+It seems that dose has an effect (Note how layers are added using `+`), let us look at it in another way
+
+## A basic boxplot
+
+```{r, echo=TRUE, out.width = "600px", fig.align="center"}
+ToothGrowth %>%
+  ggplot(mapping = aes(x = dose, y = len, group = dose)) +
+  geom_boxplot()
+```
+Looks better, but it is difficult to make out the underlying distributions
+
+## A basic histogram
+
+```{r, echo=TRUE, out.width = "600px", fig.align="center"}
+ToothGrowth %>%
+  ggplot(mapping = aes(x = len, fill = factor(dose))) +
+  geom_histogram(alpha = 0.5, binwidth = 3)
+```
+
+Not very informative, let us look at a density plot instead
+
+## A basic density plot
+
+```{r, echo=TRUE, out.width = "600px", fig.align="center"}
+ToothGrowth %>% ggplot(mapping = aes(x = len, fill = factor(dose))) + geom_density(alpha = 0.5)
+```
+
+It is a bit messy, that the densities are covering each other, try a violin plot
+
+## A basic violinplot
+
+```{r, echo=TRUE, out.width = "600px", fig.align="center"}
+ToothGrowth %>%
+  ggplot(mapping = aes(x = dose, y = len, group = dose)) +
+  geom_violin()
+```
+
+But wait, we have different types of supplements
+
+## A grouped violinplot
+
+```{r, echo=TRUE, out.width = "600px", fig.align="center"}
+ToothGrowth %>%
+  ggplot(mapping = aes(x = factor(dose), y = len, fill = supp)) +
+  geom_violin()
+```
+
+It seems, that not only is there an effect of dose, but also the supplement type
+
+## A grouped boxplot
+
+```{r, echo=TRUE, out.width = "600px", fig.align="center"}
+ToothGrowth %>%
+  ggplot(mapping = aes(x = factor(dose), y = len, fill = supp)) +
+  geom_boxplot()
+```
+
+## A scatter plot with groups and models
+
+```{r, echo=TRUE, out.width = "600px", fig.align="center"}
+ToothGrowth %>%
+  ggplot(mapping = aes(x = dose, y = len, colour = supp)) +
+  geom_point() +
+  geom_smooth(method = "lm")
+```
+
+## A scatter plot with facets, models and densities
+
+```{r, echo=TRUE, out.width = "600px", fig.align="center"}
+ToothGrowth %>%
+  ggplot(mapping = aes(x = dose, y = len)) +
+  geom_violin(mapping = aes(x = dose, y = len, group = dose)) +
+  geom_point() +
+  geom_smooth(method = "lm") +
+  facet_wrap(~supp, nrow = 1)
+```
+
+## In summary
+
+We hope it is evident just how easy ggplot makes publication ready data viualisations and how readable the code is!
+
+Now, it is time for exercises!
diff --git a/04_ggplot/lecture/ggplot_presentation.html b/04_ggplot/lecture/ggplot_presentation.html