Skip to content

Commit

Permalink
Add session 4 ggplot
Browse files Browse the repository at this point in the history
  • Loading branch information
leonjessen committed Nov 1, 2018
1 parent f3a9f72 commit 940b4ac
Show file tree
Hide file tree
Showing 4 changed files with 724 additions and 0 deletions.
76 changes: 76 additions & 0 deletions 04_ggplot/exercises/ggplot_exercises.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
title: "Exercises: Visualising data (ggplot)"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Ok, so now it is your turn!

_Remember, for the following exercises, inspiration for the code is available in the slides for this session_

Since we now know how to load data and we are to work with how we manipulate (wrangle) data, now would be a good time for you to make your first script (small program).

- In the upper right corner of RStudio, there is a small icon looking like a piece of paper with a green plus on it. Click it and choose the first option "R Script" (You can also use the short cut as described, on a mac it is command+shift+n). This will open a new empty text file in the RStudio editor, which we can put code into, save it and run it

- A recipe for a script could look like the following. Copy/paste the code below into your new empty script file (Note, the '#' means the line is ignored, so we use this for commenting our code)

```{r, message=FALSE}
# Clear workspace
# ------------------------------------------------------------------------------
rm(list=ls())
# Load libraries
# ------------------------------------------------------------------------------
library('tidyverse')
# Load data (session 2 - readr)
# ------------------------------------------------------------------------------
# Wrangle data (session 3 - dplyr)
# ------------------------------------------------------------------------------
# Visualise data (session 4 - ggplot)
# ------------------------------------------------------------------------------
# Write data (session 2 - readr)
# ------------------------------------------------------------------------------
```

- Click the save icon and save your new script as "my_script.R"

- In RStudio in the icon line just above your script file, there is an icon again looking like a piece of paper, with a blue arrow and the word "Source". Click "Source", this will run each of the lines in the script (ignoring lines beginning with '#'), so for now, it will simply clear the workspace of any variables and load the Tidyverse library every time you "Source" the script

### Question 1

1. In the slides for this dplyr session, find where we load the kommune data directly from the web and write the command in the Console (Hint: All readr read-a-file-functions, start with 'read_')
2. In the slides for the readr session, find where we write a data set as a tab-separated-values file and save the kommune data to disk as 'kommune_data.tsv', write the command in the Console (Hint: All readr write-a-file-functions, start with 'write_'
3. In your script under 'Load data' write the command for loading your data file 'kommune_data.tsv' from disk and save it in a variable called `my_data` (Hint: This is completely analogue to reading a file from the web)
4. Source the script and in the Console, simply write `my_data` and hit return

__Q1__ How many rows and columns are in the kommune data set?

### Question 2

1. Under 'Wrangle data' in your script, using the `mutate()` function write the command for calculating a new variable `inc_exp_ratio`, which is the ratio between `Indt_indkskat` and `Folkeskudg_elev`, i.e. `Indt_indkskat` divided by `Folkeskudg_elev` (Remember to save the result to your `my_data` variable)
2. In the Concole, write `my_data` and hit return

__Q2__ What is the value of this new variable for 'Koebenhavn'?

### Question 3

1. Under 'Wrangle data' in your script, using the `filter()` function, write the command for identifying all the municipalities, where the value of your new variable is larger than 1
2. Under 'Wrangle data' in your script, using the `filter()` function, write the command for identifying all the municipalities, where more than half have a long eduction and less than 1 in 5 pupils attend private scool

__Q3A__ How many municipalities have a ratio larger than 1?

__Q3B__ In how many municipalities do more than half have a long eduction and less than 1 in 5 pupils attend private scool?

### Question 4

1. Under 'Wrangle data' in your script, using the `group_by()`, `summarise()` and `arrange` functions, write the command for identifying calculating the average % of students attending private school stratified on `Region` and sort them be falling values (largest first, smallest last)

__Q4__ What is the order of Regions?
Binary file added 04_ggplot/lecture/figures/pie_chart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
236 changes: 236 additions & 0 deletions 04_ggplot/lecture/ggplot_presentation.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
---
title: "Visualising data (ggplot)"
subtitle: "CBio Thursday, session 4"
author: "Leon Eyrich Jessen & Johannes Eichler Waage"
date: "November 1st 2018"
output:
ioslides_presentation:
widescreen: true
smaller: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
set.seed(978552)
library('tidyverse')
library('readxl')
```

# Before we move into ggplot

## Wide vs long data format

```{r, echo=TRUE}
wide_data = tibble(samples = str_c('smpl_', 1:5), var_1 = round(rnorm(5),2),
var_2 = round(rnorm(5),2), var_3 = round(rnorm(5),2))
wide_data
```

Recall, I told you that each row is an observation and each column is a variable? Well...

## There is another Skywalker (Sorry)

We can convert the wide format to what is known as long format like so:

```{r, echo=TRUE}
long_data = wide_data %>% gather(key = var, value = value, -samples)
long_data
```

## Long format

This is particular useful if you have longitudinal data, as it allows you to flatten the data from a cube with time slices to a matrix. Example:

```{r, echo=TRUE}
wide_data_t1 = tibble(samples = str_c('smpl_', 1:3), var_1 = round(rnorm(3),2),
var_2 = round(rnorm(3),2))
wide_data_t2 = tibble(samples = str_c('smpl_', 1:3), var_1 = round(rnorm(3),2),
var_2 = round(rnorm(3),2))
wide_data_t3 = tibble(samples = str_c('smpl_', 1:3), var_1 = round(rnorm(3),2),
var_2 = round(rnorm(3),2))
```

Add a `time point` variable

```{r, echo=TRUE}
wide_data_t1 = wide_data_t1 %>% mutate(time_point = 1)
wide_data_t2 = wide_data_t2 %>% mutate(time_point = 2)
wide_data_t3 = wide_data_t3 %>% mutate(time_point = 3)
```

Bind the three data frames and convert to long format

```{r, echo=TRUE}
long_data_t = bind_rows(wide_data_t1, wide_data_t2, wide_data_t3) %>%
gather(key = var, value = value, -samples, -time_point)
```

## Long format

```{r, echo=TRUE}
long_data_t
```

## Long format

Depending on what you want to visualise, these long-to-wide or wide-to-long conversion, may have to be used.

So, let us just convert the `long_data_t` back to wide format, so you can see how that is done:

```{r, echo=TRUE}
long_data_t %>% spread(key = var, value = value)
```

## Summary - Long and wide data

- Convert from wide to long: `gather()`
- Convert from long to wide: `spread()`

See previous slides for examples.

Moving on to data visualisation...

# Data Visualisation using ggplot

## Cliché: A picture says more than 1,000 numbers

Except

## Cliché: A picture says more than 1,000 numbers

Except pie charts... This, this is the only valid use of a pie chart I have seen:

```{r, out.width = "600px", fig.align="center"}
knitr::include_graphics("figures/pie_chart.png")
```

## What is `ggplot`?

- The 'gg' in `ggplot` stands for Grammar-of-Graphics.

- "A grammar of graphics is a tool that enables us to concisely describe the components of a graphic" [Hadley Wickham. A layered grammar of graphics. Journal of Computational and Graphical Statistics, vol. 19, no. 1, pp. 3–28, 2010.]((http://vita.had.co.nz/papers/layered-grammar.html))

- So, a structured framework for building graphical representations of data

- Let us start with some examples

## Example data for data visualisation

We will use the `ToothGrowth` data set "The Effect of Vitamin C on Tooth Growth in Guinea Pigs", 60 observations of 3 variables:

- `len`, numeric, Tooth length
- `supp`, factor, Supplement type (VC = ascorbic acid or OJ = Orange Juice)
- `dose`, numeric, Dose in milligrams/day

```{r, echo=TRUE}
ToothGrowth %>% head(5)
```

## Example data for data visualisation

We can use the `count()` function to investigate how the 60 Guniea Pigs are distributed in the `supp` and `dose` groups:

```{r, echo=TRUE}
ToothGrowth %>% count(supp, dose)
```

## A basic barchart

We can visulise the counts using a simple barchart:

```{r, echo=TRUE, out.width = "600px", fig.align="center"}
ToothGrowth %>% count(dose, supp) %>%
ggplot(aes(x = dose, y = n, fill = supp)) +
geom_col(position = 'dodge')
```

## A basic scatterplot

```{r, echo=TRUE, out.width = "600px", fig.align="center"}
ToothGrowth %>%
ggplot(mapping = aes(x = dose, y = len)) +
geom_point()
```

It seems that dose has an effect (Note how layers are added using `+`), let us look at it in another way

## A basic boxplot

```{r, echo=TRUE, out.width = "600px", fig.align="center"}
ToothGrowth %>%
ggplot(mapping = aes(x = dose, y = len, group = dose)) +
geom_boxplot()
```
Looks better, but it is difficult to make out the underlying distributions

## A basic histogram

```{r, echo=TRUE, out.width = "600px", fig.align="center"}
ToothGrowth %>%
ggplot(mapping = aes(x = len, fill = factor(dose))) +
geom_histogram(alpha = 0.5, binwidth = 3)
```

Not very informative, let us look at a density plot instead

## A basic density plot

```{r, echo=TRUE, out.width = "600px", fig.align="center"}
ToothGrowth %>% ggplot(mapping = aes(x = len, fill = factor(dose))) + geom_density(alpha = 0.5)
```

It is a bit messy, that the densities are covering each other, try a violin plot

## A basic violinplot

```{r, echo=TRUE, out.width = "600px", fig.align="center"}
ToothGrowth %>%
ggplot(mapping = aes(x = dose, y = len, group = dose)) +
geom_violin()
```

But wait, we have different types of supplements

## A grouped violinplot

```{r, echo=TRUE, out.width = "600px", fig.align="center"}
ToothGrowth %>%
ggplot(mapping = aes(x = factor(dose), y = len, fill = supp)) +
geom_violin()
```

It seems, that not only is there an effect of dose, but also the supplement type

## A grouped boxplot

```{r, echo=TRUE, out.width = "600px", fig.align="center"}
ToothGrowth %>%
ggplot(mapping = aes(x = factor(dose), y = len, fill = supp)) +
geom_boxplot()
```

## A scatter plot with groups and models

```{r, echo=TRUE, out.width = "600px", fig.align="center"}
ToothGrowth %>%
ggplot(mapping = aes(x = dose, y = len, colour = supp)) +
geom_point() +
geom_smooth(method = "lm")
```

## A scatter plot with facets, models and densities

```{r, echo=TRUE, out.width = "600px", fig.align="center"}
ToothGrowth %>%
ggplot(mapping = aes(x = dose, y = len)) +
geom_violin(mapping = aes(x = dose, y = len, group = dose)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~supp, nrow = 1)
```

## In summary

We hope it is evident just how easy ggplot makes publication ready data viualisations and how readable the code is!

Now, it is time for exercises!
412 changes: 412 additions & 0 deletions 04_ggplot/lecture/ggplot_presentation.html

Large diffs are not rendered by default.

0 comments on commit 940b4ac

Please sign in to comment.