lec16-data-viz.Rmd

---
title: "More on data visualizations and the Git workflow (plus project work)"
author: Lindsay Coome & Joel Östblom
---

## Lesson preamble

> ### Learning Objectives
>
> - Making better plots for outreach and communication.
> - How to save your plots.
> - How to bin and summarise in 2d using hexagons and squares.
> - How to make your graphs accessible.
> - Review Git and Github workflow

> 
> ### Lesson outline
>
> - How to visualize data (review) (5 min)
> - Saving your graphs (5 min)
> - Review of plot choice (5 min)
> - 2D histograms, square and hexagonal bins (10 min)
> - Interactive and quick graphing (10 min)
> - Graphing and accessability (5 min)
> - Review Git and Github workflow (20 min)
>
> ### Setup
>
> - Load `ggplot2` (`library("ggplot2")`).

---

## Creating good graphics, continued

### Tufte's guidelines

* Reduce non-data ink
* Enhance the data ink

Reduces the proportion of graphic’s ink devoted to the non-redundant display of
data-information,

Avoid "chartjunk" - extraneous visual elements that detract from message.

### Colour rules

* Large background colours should be quiet, muted to let brighter colors stand
  out
* To highlight some element of a figure, using a bright colour can be effective
* However, if brightly colouring this aspect of your figure serves no purpose,
  leave it greyscale/plain
* Here is a reference sheet with all the [colours and colour names in
  R](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf)

### Summary

* Determine what type of data you have and how your data will be used
* Determine how many variable types are in your data
* Decide on a visual treatment for each of the variables 
* Focus on the data in your visual representation

## More with ggplot2

### Setting up

Load or install the required packages:

```{r}
library(ggplot2)
```

### Saving graphs

When using ggsave, R by default saves the last graph you plotted. You can also
use the GUI to export and save as an image or PDF.

```{r, eval=FALSE}
ggsave("filename.jpg")
ggsave("filename.jpg", width = 20, height = 20, units = "cm")
```

One quirk about RStudio is that if you don't specify the size of the graph,
ggsave will save the graph as the same size as it appears in your plots window.
This may or may not be what you want.

Try saving this graph:
```{r}
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) + 
    geom_point()
```

### Oversaturated graphs and plot choice

Last week we talked about how summary plots (especially bar plots) can sometimes
be misleading, and it is often most appropriate/ideal to show every individual
observation with a dot plot or the like, perhaps combined with summary markers
where appropriate. But, as we discussed last week, what if you have a gigantic
data set with a zillion observations? In large data sets, it is often the case
that plotting each individual observation would oversaturate the chart. 

Let's actually take a look at an oversaturated chart using one of the native
datasets, `diamonds`:

```{r}
ggplot(diamonds, aes(x = carat, y = price)) + 
  geom_point()
```

Because this is a dataset with 53940 observations and we are plotting it on two
dimensions, the resulting graph is incredibly oversaturated. Oversaturated
graphs make it *far more* difficult to glean information from the visualization.
We're going to get into a few last methods of dealing with this problem.

First, let's try making a 2D hexagonal heatmap (really a fancy histogram) with
our huge diamonds dataset. 

(It might seem like we're skipping ahead in terms of graph complexity. To learn
how to make a super simple histogram with simple count data, see
[here](https://ggplot2.tidyverse.org/reference/geom_histogram.html))

```{r}
g <- ggplot(diamonds, aes(x = carat, y = price))
g + 
  geom_hex()
```

What has this changed? Now we have a handy legend to the right of our graph
indicating the density of points across 2D space in our large dataset. We've
created our first heat map.

Wikipedia's definition of a heat map:

["A heat map (or heatmap) is a graphical representation of data where the
individual values contained in a matrix are represented as
colors."](https://en.wikipedia.org/wiki/Heat_map)

We've now added additional information to our graph and solved the saturation
problem we encountered in our first graph. If you want to change the bin size
(i.e. the size of the hexagons), you can do so as such:

```{r}
g <- ggplot(diamonds, aes(x = carat, y = price))
g + 
  geom_hex(bins = 90)
```

[Here](https://ggplot2.tidyverse.org/reference/geom_hex.html) are more resources
on hexagonal 2D heatmaps. 

If we want our heat map to be square, rather than hexagonal, we can use the
following geom:

```{r}
g <- ggplot(diamonds, aes(x = carat, y = price))
g + 
  geom_bin2d()
```

[Documentation for square/rectangular heat maps of 2D bin
counts](https://ggplot2.tidyverse.org/reference/geom_bin2d.html)

## Making your graphs accessible

### Choice of colors

Colour blindness is common in the population, and red-green colour blindness in
particular affects 8% of men and 0.5% of women. Guidelines for making your
visualizations more accessible to people affected by colour blindness, will in
many cases also improve the interpretability of your graphs for people who have
standard color vision. Here are a couple of examples:

Don't use jet rainbow-coloured heatmaps. Jet colourmaps are often the default
heatmap used in many visualization packages (you've probably seen them before). 

![](./image/heatmap.png)

Colour blind viewers are going to have a difficult time distinguishing the
meaning of this heat map if some of the colours blend together.

![](./image/colourblind.png)

The jet colormap should be avoided for other reasons, including that the sharp
transitions between colors introduces visual threshold levels that do not
represent the underlying continuous data. Another issue is luminance, or
brightness. For example, your eye is drawn to the yellow and cyan regions,
because the luminance is higher. This can have the unfortunate effect of
highlighting features in your data that don't actually exist, misleading your
viewers! It also means that your graph is not going to translate well to
greyscale in publication format.

More details about jet can be found in [this blog
post](https://jakevdp.github.io/blog/2014/10/16/how-bad-is-your-colormap/) and
[this series of
posts](https://mycarta.wordpress.com/2012/05/12/the-rainbow-is-dead-long-live-the-rainbow-part-1/).
In general, when presenting continuous data, a perceptually uniform colormap is
often the most suitable choice. This type of colormap ensures that equal steps
in data are perceived as equal steps in color space. The human brain perceives
changes in lightness as changes in the data much better than, for example,
changes in hue. Therefore, colormaps which have monotonically increasing
lightness through the colormap will be better interpreted by the viewer. More
details and examples of such colormaps are available in the [matplotlib
documentation](https://matplotlib.org/users/colormaps.html), and many of the core
design principles are outlined in [this entertaining
talk](https://www.youtube.com/watch?v=xAoljeRJ3lU). Most of these colormaps are
[available in
R](http://r4ds.had.co.nz/graphics-for-communication.html#fig:brewer). 

Another approach is to use both colours and symbols. 

```{r}
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) + 
  geom_point(aes(shape = factor(Species)), size = 3)
```

#### Challenge

Create any coloured figure you want and go to 
[this](https://www.color-blindness.com/coblis-color-blindness-simulator/)
website to upload it to see how it 
looks to a colour blind person 

For more resources,
[here](https://blog.usabilla.com/how-to-design-for-color-blindness/) is a great
usability article for designing for people with colour blindness.

### More resources on ggplot2

* [ggplot2 documentation](http://had.co.nz/ggplot2/)
* [Book by Hadley
  Wickham](https://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/dp/0387981403)
* [ggplot2 cheat
  sheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf)
* [r graph gallery for inspiration (not just limited to ggplot2
  graphs)](https://www.r-graph-gallery.com/all-graphs/)

## Git and GitHub workflow review

First, let's create a chat room for your repository. We will use gitter, since
it has good integrations with GitHub. Go to https://gitter.im/, login with your
GitHub account and click the plus sign to create a new chat room. Browse to
your repository, and add the members of your team. Since your repository is
public, this chatroom will also be public (so you might not want to share
personal information, and keep your discussions on topic, just as if this was
a real-world work project).

Discussions in the chat room are good to crystallize what tasks need to be
tackled next. Once you have identified a specific task, open an issue about it
on GitHub to track the progress of that particular tasks. You should create
issues for your planned analyses, writing, literature review, etc. Then schedule
those into your day and close them when you are done with that task. It is
similar to a powerful task manager for your project. In general, you should
create issues in the main repository, not in your own fork of the repository.
Let's create a new issue based on our discussion, assign the person we think
should work on it, and then review the workflow for addressing the issue.

### Sample workflow for addressing an issue

Let's say we want to create and upload a new file to GitHub. After we have
created the issue for this task, we go about it as follows.

1. Create your own fork of the online Git repository via the GitHub interface.
2. Click the green `Clone or download` button to copy the Git URL.
3. In RStudio, open a new Git project.
    - `File -> Version Control -> Git`. Paste the URL you copied previously and
      select where to save the directory.
4. Click `Create Project`
    - You can now see the new directory if you navigate to it in your file
      navigator.
5. In RStudio, click `History` under the small Git button menu.
    - This is similar to typing `git log --oneline --graph` in terminal.
6. Before we add something to this repository, it is good practice to create a
   new branch instead of working on the default master branch. Branches allow
   you to work on features separately and can make it clearer what is what when
   merging your code with others.
    - This is the only step that *has to* to be done from the command line.
    - Open a terminal, navigate to the new directory just created via RStudio,
      and type `git checkout -b <branchname>` to create and switch to a new
      branch. You can also switch branches in RStudio via a dropdown menu.
        - You could do the navigation in your file browser and then right click
          to open a terminal. Right click in the file browser should be enabled
          by default with GitBash on Windows. For Mac's Finder, follow [these
          instructions](https://stackoverflow.com/a/7054045/2166823) (in brief
          `System Preferences > Keyboard > Shortcuts > Services`).
7. Create a new text file in RStudio and add some text.
8. Save the file locally in the Git directory that you created via RStudio.
    - These changes are not saved into your Git history until you explicitly
      tell Git to save them.
9. Before saving (committing) into the Git repository, click `Diff <filename>`
   in the RStudio Git menu, to see the differences between your current working
   directory and what you have saved in the Git repository since previously.
    - This would be the same as typing `git diff <filename>` in the terminal.
10. The differences are what we expect, so let's click `Stage` (either the
   checkbox or the button).
    - This would be the same as typing `git add <filename>` in terminal (adding
      the file to the "staging area".
11. Now we can commit our changes, which means saving them into our local Git
   repository. Add a commit message and click `Commit`.
    - This would be the same as typing `git commit <filename> -m 'Your commit
      message here'` in terminal.
    - The reason you are staging first and then committing, instead of just
      "saving" in one step, is that this workflow gives more power and more
      flexibility. For example, it allows you to stage partial changes from a
      file or changes from several files and then commit them together as one
      commit because they are addressing the same issue.
12. You can now see your commit in the `History` tab (might need to press
   refresh). You might also notice that your forked GitHub repository online
   (`origin`) is one commit behind your local Git repo. This is because we have
   not uploaded our changes yet.
    - Remember that `origin` is just an alias (nickname) on your computer to
      the URL of your online Git storage location (GitHub in this case), so
      that you don't have to type the full URL each time. You could have called
      this anything, but `origin` is the convention.
13. To upload the changes to your fork, press `Push Branch` in the Git menu.
    - This would be the same as typing `git push` in terminal.
    - Refresh your history to see that all the branches are now on the same
      commit.
14. Finally you want to suggest that the changes from our fork are included in
   the main repository. This is where your collaborators come back into the
   picture. You want to request that whoever has control over the main
   repository reviews your changes and pulls them into the main repository if
   they look good.
    - This is usually done via the GitHub.com interface. If you go to the main
      repository or your fork, you will see a link has popped up that asks you
      to perform a pull request (PR).
15. After the code has been pulled into the main repository (merged) you can
   close the issue!
    - You could also have closed the issue via the PR directly, by adding
      `'close #<issue_number>'` to the commit message (in this case `close #1`).

Done!

One additional note to save you some potential headaches (also called merge
conflicts...). If you are not working from a Git repository that you just
cloned, you should make sure that the master branch of your local fork of the
repository, is up to date with the latest commit on the master branch of the
main repository, before you create a new branch. Put another way: if someone
else has made changes to the main repository, you want those in your fork before
you start working on new features.

If you type `git remote -v` inside your repository, you will see that it
currently only has one remote (online storage location) specified, which is your
fork with the alias `origin`. To add the main repository, go to its GitHub page
and click the green button `Clone or download`, copy the URL, and pasted into
this command `git remote add upstream <pasted_url>`. The name `upstream` is a
convention, and it will be the alias to this URL. Now you can sync your local
master branch with the upstream master branch: `git pull upstream/master
master`. These changes (and any other changes you add) will be uploaded to your
fork's GitHub location next time you push.