forked from johanneswaage/TidyThursday
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f3a9f72
commit 940b4ac
Showing
4 changed files
with
724 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
--- | ||
title: "Exercises: Visualising data (ggplot)" | ||
output: github_document | ||
--- | ||
|
||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = TRUE) | ||
``` | ||
|
||
Ok, so now it is your turn! | ||
|
||
_Remember, for the following exercises, inspiration for the code is available in the slides for this session_ | ||
|
||
Since we now know how to load data and we are to work with how we manipulate (wrangle) data, now would be a good time for you to make your first script (small program). | ||
|
||
- In the upper right corner of RStudio, there is a small icon looking like a piece of paper with a green plus on it. Click it and choose the first option "R Script" (You can also use the short cut as described, on a mac it is command+shift+n). This will open a new empty text file in the RStudio editor, which we can put code into, save it and run it | ||
|
||
- A recipe for a script could look like the following. Copy/paste the code below into your new empty script file (Note, the '#' means the line is ignored, so we use this for commenting our code) | ||
|
||
```{r, message=FALSE} | ||
# Clear workspace | ||
# ------------------------------------------------------------------------------ | ||
rm(list=ls()) | ||
# Load libraries | ||
# ------------------------------------------------------------------------------ | ||
library('tidyverse') | ||
# Load data (session 2 - readr) | ||
# ------------------------------------------------------------------------------ | ||
# Wrangle data (session 3 - dplyr) | ||
# ------------------------------------------------------------------------------ | ||
# Visualise data (session 4 - ggplot) | ||
# ------------------------------------------------------------------------------ | ||
# Write data (session 2 - readr) | ||
# ------------------------------------------------------------------------------ | ||
``` | ||
|
||
- Click the save icon and save your new script as "my_script.R" | ||
|
||
- In RStudio in the icon line just above your script file, there is an icon again looking like a piece of paper, with a blue arrow and the word "Source". Click "Source", this will run each of the lines in the script (ignoring lines beginning with '#'), so for now, it will simply clear the workspace of any variables and load the Tidyverse library every time you "Source" the script | ||
|
||
### Question 1 | ||
|
||
1. In the slides for this dplyr session, find where we load the kommune data directly from the web and write the command in the Console (Hint: All readr read-a-file-functions, start with 'read_') | ||
2. In the slides for the readr session, find where we write a data set as a tab-separated-values file and save the kommune data to disk as 'kommune_data.tsv', write the command in the Console (Hint: All readr write-a-file-functions, start with 'write_' | ||
3. In your script under 'Load data' write the command for loading your data file 'kommune_data.tsv' from disk and save it in a variable called `my_data` (Hint: This is completely analogue to reading a file from the web) | ||
4. Source the script and in the Console, simply write `my_data` and hit return | ||
|
||
__Q1__ How many rows and columns are in the kommune data set? | ||
|
||
### Question 2 | ||
|
||
1. Under 'Wrangle data' in your script, using the `mutate()` function write the command for calculating a new variable `inc_exp_ratio`, which is the ratio between `Indt_indkskat` and `Folkeskudg_elev`, i.e. `Indt_indkskat` divided by `Folkeskudg_elev` (Remember to save the result to your `my_data` variable) | ||
2. In the Concole, write `my_data` and hit return | ||
|
||
__Q2__ What is the value of this new variable for 'Koebenhavn'? | ||
|
||
### Question 3 | ||
|
||
1. Under 'Wrangle data' in your script, using the `filter()` function, write the command for identifying all the municipalities, where the value of your new variable is larger than 1 | ||
2. Under 'Wrangle data' in your script, using the `filter()` function, write the command for identifying all the municipalities, where more than half have a long eduction and less than 1 in 5 pupils attend private scool | ||
|
||
__Q3A__ How many municipalities have a ratio larger than 1? | ||
|
||
__Q3B__ In how many municipalities do more than half have a long eduction and less than 1 in 5 pupils attend private scool? | ||
|
||
### Question 4 | ||
|
||
1. Under 'Wrangle data' in your script, using the `group_by()`, `summarise()` and `arrange` functions, write the command for identifying calculating the average % of students attending private school stratified on `Region` and sort them be falling values (largest first, smallest last) | ||
|
||
__Q4__ What is the order of Regions? |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,236 @@ | ||
--- | ||
title: "Visualising data (ggplot)" | ||
subtitle: "CBio Thursday, session 4" | ||
author: "Leon Eyrich Jessen & Johannes Eichler Waage" | ||
date: "November 1st 2018" | ||
output: | ||
ioslides_presentation: | ||
widescreen: true | ||
smaller: true | ||
--- | ||
|
||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = FALSE) | ||
set.seed(978552) | ||
library('tidyverse') | ||
library('readxl') | ||
``` | ||
|
||
# Before we move into ggplot | ||
|
||
## Wide vs long data format | ||
|
||
```{r, echo=TRUE} | ||
wide_data = tibble(samples = str_c('smpl_', 1:5), var_1 = round(rnorm(5),2), | ||
var_2 = round(rnorm(5),2), var_3 = round(rnorm(5),2)) | ||
wide_data | ||
``` | ||
|
||
Recall, I told you that each row is an observation and each column is a variable? Well... | ||
|
||
## There is another Skywalker (Sorry) | ||
|
||
We can convert the wide format to what is known as long format like so: | ||
|
||
```{r, echo=TRUE} | ||
long_data = wide_data %>% gather(key = var, value = value, -samples) | ||
long_data | ||
``` | ||
|
||
## Long format | ||
|
||
This is particular useful if you have longitudinal data, as it allows you to flatten the data from a cube with time slices to a matrix. Example: | ||
|
||
```{r, echo=TRUE} | ||
wide_data_t1 = tibble(samples = str_c('smpl_', 1:3), var_1 = round(rnorm(3),2), | ||
var_2 = round(rnorm(3),2)) | ||
wide_data_t2 = tibble(samples = str_c('smpl_', 1:3), var_1 = round(rnorm(3),2), | ||
var_2 = round(rnorm(3),2)) | ||
wide_data_t3 = tibble(samples = str_c('smpl_', 1:3), var_1 = round(rnorm(3),2), | ||
var_2 = round(rnorm(3),2)) | ||
``` | ||
|
||
Add a `time point` variable | ||
|
||
```{r, echo=TRUE} | ||
wide_data_t1 = wide_data_t1 %>% mutate(time_point = 1) | ||
wide_data_t2 = wide_data_t2 %>% mutate(time_point = 2) | ||
wide_data_t3 = wide_data_t3 %>% mutate(time_point = 3) | ||
``` | ||
|
||
Bind the three data frames and convert to long format | ||
|
||
```{r, echo=TRUE} | ||
long_data_t = bind_rows(wide_data_t1, wide_data_t2, wide_data_t3) %>% | ||
gather(key = var, value = value, -samples, -time_point) | ||
``` | ||
|
||
## Long format | ||
|
||
```{r, echo=TRUE} | ||
long_data_t | ||
``` | ||
|
||
## Long format | ||
|
||
Depending on what you want to visualise, these long-to-wide or wide-to-long conversion, may have to be used. | ||
|
||
So, let us just convert the `long_data_t` back to wide format, so you can see how that is done: | ||
|
||
```{r, echo=TRUE} | ||
long_data_t %>% spread(key = var, value = value) | ||
``` | ||
|
||
## Summary - Long and wide data | ||
|
||
- Convert from wide to long: `gather()` | ||
- Convert from long to wide: `spread()` | ||
|
||
See previous slides for examples. | ||
|
||
Moving on to data visualisation... | ||
|
||
# Data Visualisation using ggplot | ||
|
||
## Cliché: A picture says more than 1,000 numbers | ||
|
||
Except | ||
|
||
## Cliché: A picture says more than 1,000 numbers | ||
|
||
Except pie charts... This, this is the only valid use of a pie chart I have seen: | ||
|
||
```{r, out.width = "600px", fig.align="center"} | ||
knitr::include_graphics("figures/pie_chart.png") | ||
``` | ||
|
||
## What is `ggplot`? | ||
|
||
- The 'gg' in `ggplot` stands for Grammar-of-Graphics. | ||
|
||
- "A grammar of graphics is a tool that enables us to concisely describe the components of a graphic" [Hadley Wickham. A layered grammar of graphics. Journal of Computational and Graphical Statistics, vol. 19, no. 1, pp. 3–28, 2010.]((http://vita.had.co.nz/papers/layered-grammar.html)) | ||
|
||
- So, a structured framework for building graphical representations of data | ||
|
||
- Let us start with some examples | ||
|
||
## Example data for data visualisation | ||
|
||
We will use the `ToothGrowth` data set "The Effect of Vitamin C on Tooth Growth in Guinea Pigs", 60 observations of 3 variables: | ||
|
||
- `len`, numeric, Tooth length | ||
- `supp`, factor, Supplement type (VC = ascorbic acid or OJ = Orange Juice) | ||
- `dose`, numeric, Dose in milligrams/day | ||
|
||
```{r, echo=TRUE} | ||
ToothGrowth %>% head(5) | ||
``` | ||
|
||
## Example data for data visualisation | ||
|
||
We can use the `count()` function to investigate how the 60 Guniea Pigs are distributed in the `supp` and `dose` groups: | ||
|
||
```{r, echo=TRUE} | ||
ToothGrowth %>% count(supp, dose) | ||
``` | ||
|
||
## A basic barchart | ||
|
||
We can visulise the counts using a simple barchart: | ||
|
||
```{r, echo=TRUE, out.width = "600px", fig.align="center"} | ||
ToothGrowth %>% count(dose, supp) %>% | ||
ggplot(aes(x = dose, y = n, fill = supp)) + | ||
geom_col(position = 'dodge') | ||
``` | ||
|
||
## A basic scatterplot | ||
|
||
```{r, echo=TRUE, out.width = "600px", fig.align="center"} | ||
ToothGrowth %>% | ||
ggplot(mapping = aes(x = dose, y = len)) + | ||
geom_point() | ||
``` | ||
|
||
It seems that dose has an effect (Note how layers are added using `+`), let us look at it in another way | ||
|
||
## A basic boxplot | ||
|
||
```{r, echo=TRUE, out.width = "600px", fig.align="center"} | ||
ToothGrowth %>% | ||
ggplot(mapping = aes(x = dose, y = len, group = dose)) + | ||
geom_boxplot() | ||
``` | ||
Looks better, but it is difficult to make out the underlying distributions | ||
|
||
## A basic histogram | ||
|
||
```{r, echo=TRUE, out.width = "600px", fig.align="center"} | ||
ToothGrowth %>% | ||
ggplot(mapping = aes(x = len, fill = factor(dose))) + | ||
geom_histogram(alpha = 0.5, binwidth = 3) | ||
``` | ||
|
||
Not very informative, let us look at a density plot instead | ||
|
||
## A basic density plot | ||
|
||
```{r, echo=TRUE, out.width = "600px", fig.align="center"} | ||
ToothGrowth %>% ggplot(mapping = aes(x = len, fill = factor(dose))) + geom_density(alpha = 0.5) | ||
``` | ||
|
||
It is a bit messy, that the densities are covering each other, try a violin plot | ||
|
||
## A basic violinplot | ||
|
||
```{r, echo=TRUE, out.width = "600px", fig.align="center"} | ||
ToothGrowth %>% | ||
ggplot(mapping = aes(x = dose, y = len, group = dose)) + | ||
geom_violin() | ||
``` | ||
|
||
But wait, we have different types of supplements | ||
|
||
## A grouped violinplot | ||
|
||
```{r, echo=TRUE, out.width = "600px", fig.align="center"} | ||
ToothGrowth %>% | ||
ggplot(mapping = aes(x = factor(dose), y = len, fill = supp)) + | ||
geom_violin() | ||
``` | ||
|
||
It seems, that not only is there an effect of dose, but also the supplement type | ||
|
||
## A grouped boxplot | ||
|
||
```{r, echo=TRUE, out.width = "600px", fig.align="center"} | ||
ToothGrowth %>% | ||
ggplot(mapping = aes(x = factor(dose), y = len, fill = supp)) + | ||
geom_boxplot() | ||
``` | ||
|
||
## A scatter plot with groups and models | ||
|
||
```{r, echo=TRUE, out.width = "600px", fig.align="center"} | ||
ToothGrowth %>% | ||
ggplot(mapping = aes(x = dose, y = len, colour = supp)) + | ||
geom_point() + | ||
geom_smooth(method = "lm") | ||
``` | ||
|
||
## A scatter plot with facets, models and densities | ||
|
||
```{r, echo=TRUE, out.width = "600px", fig.align="center"} | ||
ToothGrowth %>% | ||
ggplot(mapping = aes(x = dose, y = len)) + | ||
geom_violin(mapping = aes(x = dose, y = len, group = dose)) + | ||
geom_point() + | ||
geom_smooth(method = "lm") + | ||
facet_wrap(~supp, nrow = 1) | ||
``` | ||
|
||
## In summary | ||
|
||
We hope it is evident just how easy ggplot makes publication ready data viualisations and how readable the code is! | ||
|
||
Now, it is time for exercises! |
Large diffs are not rendered by default.
Oops, something went wrong.