forked from UofTCoders/rcourse
-
Notifications
You must be signed in to change notification settings - Fork 0
/
lec16-data-viz.Rmd
348 lines (284 loc) · 15 KB
/
lec16-data-viz.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
---
title: "More on data visualizations and the Git workflow (plus project work)"
author: Lindsay Coome & Joel Östblom
---
## Lesson preamble
> ### Learning Objectives
>
> - Making better plots for outreach and communication.
> - How to save your plots.
> - How to bin and summarise in 2d using hexagons and squares.
> - How to make your graphs accessible.
> - Review Git and Github workflow
>
> ### Lesson outline
>
> - How to visualize data (review) (5 min)
> - Saving your graphs (5 min)
> - Review of plot choice (5 min)
> - 2D histograms, square and hexagonal bins (10 min)
> - Interactive and quick graphing (10 min)
> - Graphing and accessability (5 min)
> - Review Git and Github workflow (20 min)
>
> ### Setup
>
> - Load `ggplot2` (`library("ggplot2")`).
---
## Creating good graphics, continued
### Tufte's guidelines
* Reduce non-data ink
* Enhance the data ink
Reduces the proportion of graphic’s ink devoted to the non-redundant display of
data-information,
Avoid "chartjunk" - extraneous visual elements that detract from message.
### Colour rules
* Large background colours should be quiet, muted to let brighter colors stand
out
* To highlight some element of a figure, using a bright colour can be effective
* However, if brightly colouring this aspect of your figure serves no purpose,
leave it greyscale/plain
* Here is a reference sheet with all the [colours and colour names in
R](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf)
### Summary
* Determine what type of data you have and how your data will be used
* Determine how many variable types are in your data
* Decide on a visual treatment for each of the variables
* Focus on the data in your visual representation
## More with ggplot2
### Setting up
Load or install the required packages:
```{r}
library(ggplot2)
```
### Saving graphs
When using ggsave, R by default saves the last graph you plotted. You can also
use the GUI to export and save as an image or PDF.
```{r, eval=FALSE}
ggsave("filename.jpg")
ggsave("filename.jpg", width = 20, height = 20, units = "cm")
```
One quirk about RStudio is that if you don't specify the size of the graph,
ggsave will save the graph as the same size as it appears in your plots window.
This may or may not be what you want.
Try saving this graph:
```{r}
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
geom_point()
```
### Oversaturated graphs and plot choice
Last week we talked about how summary plots (especially bar plots) can sometimes
be misleading, and it is often most appropriate/ideal to show every individual
observation with a dot plot or the like, perhaps combined with summary markers
where appropriate. But, as we discussed last week, what if you have a gigantic
data set with a zillion observations? In large data sets, it is often the case
that plotting each individual observation would oversaturate the chart.
Let's actually take a look at an oversaturated chart using one of the native
datasets, `diamonds`:
```{r}
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()
```
Because this is a dataset with 53940 observations and we are plotting it on two
dimensions, the resulting graph is incredibly oversaturated. Oversaturated
graphs make it *far more* difficult to glean information from the visualization.
We're going to get into a few last methods of dealing with this problem.
First, let's try making a 2D hexagonal heatmap (really a fancy histogram) with
our huge diamonds dataset.
(It might seem like we're skipping ahead in terms of graph complexity. To learn
how to make a super simple histogram with simple count data, see
[here](https://ggplot2.tidyverse.org/reference/geom_histogram.html))
```{r}
g <- ggplot(diamonds, aes(x = carat, y = price))
g +
geom_hex()
```
What has this changed? Now we have a handy legend to the right of our graph
indicating the density of points across 2D space in our large dataset. We've
created our first heat map.
Wikipedia's definition of a heat map:
["A heat map (or heatmap) is a graphical representation of data where the
individual values contained in a matrix are represented as
colors."](https://en.wikipedia.org/wiki/Heat_map)
We've now added additional information to our graph and solved the saturation
problem we encountered in our first graph. If you want to change the bin size
(i.e. the size of the hexagons), you can do so as such:
```{r}
g <- ggplot(diamonds, aes(x = carat, y = price))
g +
geom_hex(bins = 90)
```
[Here](https://ggplot2.tidyverse.org/reference/geom_hex.html) are more resources
on hexagonal 2D heatmaps.
If we want our heat map to be square, rather than hexagonal, we can use the
following geom:
```{r}
g <- ggplot(diamonds, aes(x = carat, y = price))
g +
geom_bin2d()
```
[Documentation for square/rectangular heat maps of 2D bin
counts](https://ggplot2.tidyverse.org/reference/geom_bin2d.html)
## Making your graphs accessible
### Choice of colors
Colour blindness is common in the population, and red-green colour blindness in
particular affects 8% of men and 0.5% of women. Guidelines for making your
visualizations more accessible to people affected by colour blindness, will in
many cases also improve the interpretability of your graphs for people who have
standard color vision. Here are a couple of examples:
Don't use jet rainbow-coloured heatmaps. Jet colourmaps are often the default
heatmap used in many visualization packages (you've probably seen them before).
![](./image/heatmap.png)
Colour blind viewers are going to have a difficult time distinguishing the
meaning of this heat map if some of the colours blend together.
![](./image/colourblind.png)
The jet colormap should be avoided for other reasons, including that the sharp
transitions between colors introduces visual threshold levels that do not
represent the underlying continuous data. Another issue is luminance, or
brightness. For example, your eye is drawn to the yellow and cyan regions,
because the luminance is higher. This can have the unfortunate effect of
highlighting features in your data that don't actually exist, misleading your
viewers! It also means that your graph is not going to translate well to
greyscale in publication format.
More details about jet can be found in [this blog
post](https://jakevdp.github.io/blog/2014/10/16/how-bad-is-your-colormap/) and
[this series of
posts](https://mycarta.wordpress.com/2012/05/12/the-rainbow-is-dead-long-live-the-rainbow-part-1/).
In general, when presenting continuous data, a perceptually uniform colormap is
often the most suitable choice. This type of colormap ensures that equal steps
in data are perceived as equal steps in color space. The human brain perceives
changes in lightness as changes in the data much better than, for example,
changes in hue. Therefore, colormaps which have monotonically increasing
lightness through the colormap will be better interpreted by the viewer. More
details and examples of such colormaps are available in the [matplotlib
documentation](https://matplotlib.org/users/colormaps.html), and many of the core
design principles are outlined in [this entertaining
talk](https://www.youtube.com/watch?v=xAoljeRJ3lU). Most of these colormaps are
[available in
R](http://r4ds.had.co.nz/graphics-for-communication.html#fig:brewer).
Another approach is to use both colours and symbols.
```{r}
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
geom_point(aes(shape = factor(Species)), size = 3)
```
#### Challenge
Create any coloured figure you want and go to
[this](https://www.color-blindness.com/coblis-color-blindness-simulator/)
website to upload it to see how it
looks to a colour blind person
For more resources,
[here](https://blog.usabilla.com/how-to-design-for-color-blindness/) is a great
usability article for designing for people with colour blindness.
### More resources on ggplot2
* [ggplot2 documentation](http://had.co.nz/ggplot2/)
* [Book by Hadley
Wickham](https://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/dp/0387981403)
* [ggplot2 cheat
sheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf)
* [r graph gallery for inspiration (not just limited to ggplot2
graphs)](https://www.r-graph-gallery.com/all-graphs/)
## Git and GitHub workflow review
First, let's create a chat room for your repository. We will use gitter, since
it has good integrations with GitHub. Go to https://gitter.im/, login with your
GitHub account and click the plus sign to create a new chat room. Browse to
your repository, and add the members of your team. Since your repository is
public, this chatroom will also be public (so you might not want to share
personal information, and keep your discussions on topic, just as if this was
a real-world work project).
Discussions in the chat room are good to crystallize what tasks need to be
tackled next. Once you have identified a specific task, open an issue about it
on GitHub to track the progress of that particular tasks. You should create
issues for your planned analyses, writing, literature review, etc. Then schedule
those into your day and close them when you are done with that task. It is
similar to a powerful task manager for your project. In general, you should
create issues in the main repository, not in your own fork of the repository.
Let's create a new issue based on our discussion, assign the person we think
should work on it, and then review the workflow for addressing the issue.
### Sample workflow for addressing an issue
Let's say we want to create and upload a new file to GitHub. After we have
created the issue for this task, we go about it as follows.
1. Create your own fork of the online Git repository via the GitHub interface.
2. Click the green `Clone or download` button to copy the Git URL.
3. In RStudio, open a new Git project.
- `File -> Version Control -> Git`. Paste the URL you copied previously and
select where to save the directory.
4. Click `Create Project`
- You can now see the new directory if you navigate to it in your file
navigator.
5. In RStudio, click `History` under the small Git button menu.
- This is similar to typing `git log --oneline --graph` in terminal.
6. Before we add something to this repository, it is good practice to create a
new branch instead of working on the default master branch. Branches allow
you to work on features separately and can make it clearer what is what when
merging your code with others.
- This is the only step that *has to* to be done from the command line.
- Open a terminal, navigate to the new directory just created via RStudio,
and type `git checkout -b <branchname>` to create and switch to a new
branch. You can also switch branches in RStudio via a dropdown menu.
- You could do the navigation in your file browser and then right click
to open a terminal. Right click in the file browser should be enabled
by default with GitBash on Windows. For Mac's Finder, follow [these
instructions](https://stackoverflow.com/a/7054045/2166823) (in brief
`System Preferences > Keyboard > Shortcuts > Services`).
7. Create a new text file in RStudio and add some text.
8. Save the file locally in the Git directory that you created via RStudio.
- These changes are not saved into your Git history until you explicitly
tell Git to save them.
9. Before saving (committing) into the Git repository, click `Diff <filename>`
in the RStudio Git menu, to see the differences between your current working
directory and what you have saved in the Git repository since previously.
- This would be the same as typing `git diff <filename>` in the terminal.
10. The differences are what we expect, so let's click `Stage` (either the
checkbox or the button).
- This would be the same as typing `git add <filename>` in terminal (adding
the file to the "staging area".
11. Now we can commit our changes, which means saving them into our local Git
repository. Add a commit message and click `Commit`.
- This would be the same as typing `git commit <filename> -m 'Your commit
message here'` in terminal.
- The reason you are staging first and then committing, instead of just
"saving" in one step, is that this workflow gives more power and more
flexibility. For example, it allows you to stage partial changes from a
file or changes from several files and then commit them together as one
commit because they are addressing the same issue.
12. You can now see your commit in the `History` tab (might need to press
refresh). You might also notice that your forked GitHub repository online
(`origin`) is one commit behind your local Git repo. This is because we have
not uploaded our changes yet.
- Remember that `origin` is just an alias (nickname) on your computer to
the URL of your online Git storage location (GitHub in this case), so
that you don't have to type the full URL each time. You could have called
this anything, but `origin` is the convention.
13. To upload the changes to your fork, press `Push Branch` in the Git menu.
- This would be the same as typing `git push` in terminal.
- Refresh your history to see that all the branches are now on the same
commit.
14. Finally you want to suggest that the changes from our fork are included in
the main repository. This is where your collaborators come back into the
picture. You want to request that whoever has control over the main
repository reviews your changes and pulls them into the main repository if
they look good.
- This is usually done via the GitHub.com interface. If you go to the main
repository or your fork, you will see a link has popped up that asks you
to perform a pull request (PR).
15. After the code has been pulled into the main repository (merged) you can
close the issue!
- You could also have closed the issue via the PR directly, by adding
`'close #<issue_number>'` to the commit message (in this case `close #1`).
Done!
One additional note to save you some potential headaches (also called merge
conflicts...). If you are not working from a Git repository that you just
cloned, you should make sure that the master branch of your local fork of the
repository, is up to date with the latest commit on the master branch of the
main repository, before you create a new branch. Put another way: if someone
else has made changes to the main repository, you want those in your fork before
you start working on new features.
If you type `git remote -v` inside your repository, you will see that it
currently only has one remote (online storage location) specified, which is your
fork with the alias `origin`. To add the main repository, go to its GitHub page
and click the green button `Clone or download`, copy the URL, and pasted into
this command `git remote add upstream <pasted_url>`. The name `upstream` is a
convention, and it will be the alias to this URL. Now you can sync your local
master branch with the upstream master branch: `git pull upstream/master
master`. These changes (and any other changes you add) will be uploaded to your
fork's GitHub location next time you push.