-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path02c-good-graphics.qmd
522 lines (374 loc) · 30.2 KB
/
02c-good-graphics.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
# Creating Good Charts {#sec-good-graphics}
## {{< fa bullseye >}} Objectives
- Understand what features make graphics effective
- Evaluate existing charts for accessibility and readability
- Make improvements to charts to increase comprehension and accessibility
A chart is good if it allows the user to draw useful conclusions that are supported by data. Obviously, this definition depends on the purpose of the chart - a simple EDA chart is going to have a different purpose than a chart showing e.g. the predicted path of a hurricane, which people will use to make decisions about whether or not to evacuate.
Unfortunately, while our visual system is *amazing*, it is not always as accurate as the computers we use to render graphics. We have physical limits in the number of colors we can perceive, our short term memory, attention, and our ability to accurately read information off of charts in different forms.
## Perceptual and Cognitive Factors
### Preattentive Features
You've almost certainly noticed that some graphical tasks are easier than others.
Part of the reason for this is that certain tasks require active engagement and attention to search through the visual stimulus; others, however, just "pop" out of the background.
We call these features that just "pop" without active work **preattentive** features; technically, they are detected within the first 250ms of viewing a stimulus [@treisman_preattentive_1985].
Take a look at @fig-preattentive1; can you spot the point that is different?
```{r preattentive1}
#| echo: false
#| fig-width: 4
#| fig-height: 4
#| label: fig-preattentive1
#| fig-cap: "Two scatterplots with one point that is different. Can you easily spot the different point?"
#| fig-subcap:
#| - "Shape"
#| - "Color"
#| layout-ncol: 2
set.seed(153253)
data <- data.frame(expand.grid(x=1:6, y=1:6), color=sample(c(1,2), 36, replace=TRUE))
data$x <- data$x+rnorm(36, 0, .25)
data$y <- data$y+rnorm(36, 0, .25)
suppressWarnings(library(ggplot2))
new_theme_empty <- theme_bw()
new_theme_empty$line <- element_blank()
# new_theme_empty$rect <- element_blank()
new_theme_empty$strip.text <- element_blank()
new_theme_empty$axis.text <- element_blank()
new_theme_empty$plot.title <- element_blank()
new_theme_empty$axis.title <- element_blank()
# new_theme_empty$plot.margin <- structure(c(0, 0, -1, -1), unit = "lines", valid.unit = 3L, class = "unit")
data$shape <- c(rep(2, 15), 1, rep(2,20))
library(scales)
ggplot(data=data, aes(x=x, y=y, color=factor(1, levels=c(1,2)), shape=factor(shape), size=I(5))) +
geom_point() +
scale_shape_manual(guide="none", values=c(19, 15)) +
scale_color_discrete(guide="none") +
new_theme_empty
data$shape <- c(rep(2, 25), 1, rep(2,10))
ggplot(data=data, aes(x=x, y=y, color=factor(shape), shape=I(19), size=I(5)))+
geom_point() +
scale_color_discrete(guide="none") +
new_theme_empty
```
Color and shape are commonly used graphical features that are processed pre-attentively.
Some people suggest utilizing this to pack more dimensions into multivariate visualizations [@healey_high-speed_1996], but in general, knowing which features are processed more quickly (color/shape) and which are processed more slowly (combinations of preattentively processed features) allows you to design a chart that requires less cognitive effort to read.
As awesome as it is to be able to use preattentive features to process information, we should not use combinations of preattentive features to show different variables. Take a look at @fig-preattentive2 - part (a) shows the same grouping in color and shape, part (b) shows color and shape used to encode different variables.
```{r preattentive2,echo=FALSE,include=T, fig.width=4, fig.height=4, out.width = "49.5%"}
#| echo: false
#| fig-width: 4
#| fig-height: 4
#| label: fig-preattentive2
#| fig-cap: "Two scatterplots. Can you easily spot the different point(s)?"
#| fig-subcap:
#| - "Shape and Color (dual encoded)"
#| - "Shape and Color (different variables)"
#| layout-ncol: 2
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))
set.seed(1532534)
data <- data.frame(expand.grid(x=1:6, y=1:6), color=sample(c(1,2), 36, replace=TRUE))
data$x <- data$x+rnorm(36, 0, .25)
data$y <- data$y+rnorm(36, 0, .25)
suppressWarnings(library(ggplot2))
new_theme_empty <- theme_bw()
new_theme_empty$line <- element_blank()
# new_theme_empty$rect <- element_blank()
new_theme_empty$strip.text <- element_blank()
new_theme_empty$axis.text <- element_blank()
new_theme_empty$plot.title <- element_blank()
new_theme_empty$axis.title <- element_blank()
# new_theme_empty$plot.margin <- structure(c(0, 0, -1, -1), unit = "lines", valid.unit = 3L, class = "unit")
data$shape <- data$color
ggplot(data=data, aes(x=x, y=y,
color=factor(color),
shape=factor(shape))) +
geom_point(size = 5) +
scale_shape_manual(guide="none", values=c(19, 15)) +
scale_color_discrete(guide="none") +
new_theme_empty
data <- data %>%
mutate(sample = row_number() %in% sample(1:n(), 4)) %>%
mutate(shape = if_else(sample, if_else(shape == 1, 2, 1), shape))
ggplot(data=data, aes(x=x, y=y,
color=factor(color),
shape=factor(shape))) +
geom_point(size = 5) +
scale_shape_manual(guide="none", values=c(19, 15)) +
scale_color_discrete(guide="none") +
new_theme_empty
# qplot(data=data, x=x, y=y, color=factor(color), shape=factor(shape), size=I(5))+scale_shape_manual(guide="none", values=c(19, 15)) + scale_color_discrete(guide="none") + new_theme_empty
```
Here, it is easy to differentiate the points in @fig-preattentive2(a), because they are dual-encoded. However, it is very difficult to pick out the different groups of points in @fig-preattentive2(b) because the combination of preattentive features requires active attention to sort out.
#### Takeaways {-}
Careful use of preattentive features can reduce the cognitive effort required for viewers to perceive a chart.
Encode only one variable using preattentive features, as combinations of preattentive features are not processed preattentively.
### Color
Our eyes are optimized for perceiving the yellow/green region of the color spectrum, as shown in @fig-sensitivity. Why? Well, our sun produces yellow light, and plants tend to be green. It's pretty important to be able to distinguish different shades of green (evolutionarily speaking) because it impacts your ability to feed yourself. There aren't that many purple or blue predators, so there is less selection pressure to improve perception of that part of the visual spectrum.
)](../images/wrangling/Eyesensitivity.png){#fig-sensitivity}
Not everyone perceives color in the same way. Some individuals are [colorblind or color deficient](https://en.wikipedia.org/wiki/Color_blindness) [@wikipediacontributors23c]. We have 3 cones used for color detection, as well as cells called rods, which detect light intensity (brightness/darkness). In about 5% of the population (10% of XY individuals, \<1% of XX individuals), one or more of the cones may be missing or malformed, leading to color blindness - a reduced ability to perceive different shades. The rods, however, function normally in almost all of the population, which means that light/dark contrasts are extremely safe, while contrasts based on the hue of the color are problematic in some instances.
::: {.callout-note collapse=true}
#### Colorblindness Testing
You can take a test designed to screen for colorblindness [here](https://eyeque.com/color-blind-test/)
Your monitor may affect how you score on these tests - I am colorblind, but on some monitors, I can pass the test, and on some, I perform worse than normal. A different test is available [here](https://www.color-blindness.com/farnsworth-munsell-100-hue-color-vision-test/).
 
 In reality, I know that I have issues with perceiving some shades of red, green, and brown. I have particular trouble with very dark or very light colors, especially when they are close to grey or brown.
:::
In addition to colorblindness, there are other factors than the actual color value which are important in how we experience color, such as context.
```{r}
#| label: fig-checker-shadow
#| echo: false
#| layout: "[[70, 30]]"
#| fig-cap: "The color constancy illusion. The squares marked A and B are actually the same color."
#| fig-subcap:
#| - "The original illusion"
#| - "The illusion with the checkerboard and shadow removed"
#| out-width: "100%"
knitr::include_graphics("../images/wrangling/CheckerShadow.png")
knitr::include_graphics("../images/wrangling/CheckerShadow2.png")
```
Our brains are extremely dependent on context and make excellent use of the large amounts of experience we have with the real world.
As a result, we implicitly "remove" the effect of things like shadows as we make sense of the input to the visual system.
This can result in odd things, like the checkerboard and shadow shown in @fig-checker-shadow - because we're correcting for the shadow, B looks lighter than A even though when the context is removed they are clearly the same shade.
#### Takeaways
- Do not use rainbow color gradient schemes
- because of the unequal perception of different wavelengths, these schemes are *misleading* - the color distance does not match the perceptual distance.
- Avoid any scheme that uses green-yellow-red signaling if you have a target audience that may include colorblind people.
- To "colorblind-proof" a graphic, you can use a couple of strategies:
- double encoding - where you use color, use another aesthetic (line type, shape) as well to help your colorblind readers out
- If you can print your chart out in black and white and still read it, it will be safe for colorblind users. This is the only foolproof way to do it!
- If you are using a color gradient, use a monochromatic color scheme where possible. This is perceived as light -\> dark by colorblind people, so it will be correctly perceived no matter what color you use.
- If you have a bidirectional scale (e.g. showing positive and negative values), the safest scheme to use is purple - white - orange. In any color scale that is multi-hue, it is important to transition through white, instead of from one color to another directly.
- Be conscious of what certain colors "mean"
- Leveraging common associations can make it easier to read a color scale and remember what it stands for
- blue for cold, orange/red for hot is a natural scale
- red = Republican and blue = Democrat in the US (since ~1980)
- white -\> blue gradients for showing rainfall totals
- Some colors can can provoke emotional responses that may not be desirable.[^10-graphics-2]
- Consider the social baggage that certain color schemes may have
- pink/blue color scheme often used for gender may be polarizing
- Consider using a cooler color (blue or purple) for men and a warmer color (yellow, orange, lighter green) for women[^10-graphics-3].
- There are packages such as `RColorBrewer` and `dichromat` that have color palettes which are aesthetically pleasing, and, in many cases, colorblind friendly (`dichromat` is better for that than `RColorBrewer`). You can also take a look at other [ways to find nice color palettes](https://lisacharlotterost.de/2016/04/22/Colors-for-DataVis/).
[^10-graphics-2]: When the COVID-19 outbreak started, many maps were using white-to-red gradients to show case counts and/or deaths. [The emotional association between red and blood, danger, and death may have caused people to become more frightened than what was reasonable given the available information.](https://www.esri.com/arcgis-blog/products/product/mapping/mapping-coronavirus-responsibly/)
[^10-graphics-3]: Lisa Charlotte Rost. [What to consider when choosing colors for data visualization.](https://www.dataquest.io/blog/what-to-consider-when-choosing-colors-for-data-visualization/)
### Short Term Memory
We have a limited amount of memory that we can instantaneously utilize. This mental space, called **short-term memory**, holds information for active use, but only for a limited amount of time.
::: callout-tip
#### Try it out! {.unnumbered}
<details>
<summary>Click here, read the information, and then click to hide it.</summary>
1 4 2 2 3 9 8 0 7 8
</details>
<details>
<summary>Wait a few seconds, then expand this section</summary>
What was the third number?
</details>
:::
Without rehearsing the information (repeating it over and over to yourself), the try it out task may have been challenging. Short term memory has a capacity of between 3 and 9 "bits" of information.
In charts and graphs, short term memory is important because we need to be able to associate information from e.g. a key, legend, or caption with information plotted on the graph. As a result, if you try to plot more than \~6 categories of information, your reader will have to shift between the legend and the graph repeatedly, increasing the amount of cognitive labor required to digest the information in the chart.
Where possible, try to keep your legends to 6 or 7 characteristics.
**Implications and Guidelines**
- Limit the number of categories in your legends to minimize the short term memory demands on your reader.
- When using continuous color schemes, you may want to use a log scale to better show differences in value across orders of magnitude.
- Use colors and symbols which have implicit meaning to minimize the need to refer to the legend.
- Add annotations on the plot, where possible, to reduce the need to re-read captions.
### Grouping and Sense-making
Imposing order on visual chaos.
::: panel-tabset
#### Ambiguous Images {.unnumbered}
What does @fig-rabbit-duck look like to you?
{#fig-rabbit-duck width="50%" fig-alt="An ambiguous image: when viewed from the right side, it looks like a duck, and when viewed from the bottom side, it looks like a rabbit. The x and y axes are labeled as 'rabbit' and 'duck', respectively."}
When faced with ambiguity, our brains use available context and past experience to try to tip the balance between alternate interpretations of an image. When there is still some ambiguity, many times the brain will just decide to interpret an image as one of the possible options.
#### Illusory Contours {.unnumbered}
{fig-alt="An Illusory contour. It appears to be 3 black circles arranged in a downward-pointing triangle, with a black outline of a triangle pointing upward, and a background-colored (white) downward pointing triangle overlaid on top of the two previously described triangles. In reality, what is shown is a sequence of three pac-man shapes, with the missing pieces oriented inwards, at approximately 30, 150, and 270 degrees from the positive x-axis, with three 60 degree acute angle shapes oriented at 90, 210, and 330 degrees from the positive x-axis. The appearance of two triangles superimposed is an illusory contour that results from Gestalt heuristics." }
Did you see something like "3 circles, a triangle with a black outline, and a white triangle on top of that"? In reality, there are 3 angles and 3 pac-man shapes. But, it's much more likely that we're seeing layers of information, where some of the information is obscured (like the "mouth" of the pac-man circles, or the middle segment of each side of the triangle). This explanation is simpler, and more consistent with our experience.
#### Figure/Ground {.unnumbered}
Now, look at the logo for the Pittsburgh Zoo.
{fig-alt="The logo of the Pittsburgh Zoo, which manipulates figure-ground perception. If the black portion of the image is considered the figure, the viewer sees a tree and some birds flying; if the white portion of the image is considered the figure, then the viewer sees a gorilla facing a lion, with some fish jumping above the waves at the bottom of the image. When viewed for several minutes, the figure and ground tend to 'flip' back and forth." width="80%"}
Do you see the gorilla and lionness? Or do you see a tree? Here, we're not entirely sure which part of the image is the figure and which is the background.
:::
The ambiguous figures shown above demonstrate that our brains are actively imposing order upon the visual stimuli we encounter. There are some heuristics for how this order is applied which impact our perception of statistical graphs.
The catchphrase of Gestalt psychology is
> The whole is greater than the sum of the parts
That is, what we perceive and the meaning we derive from the visual scene is more than the individual components of that visual scene.
{fig-alt="The word GESTALT, written such that there is a white bar over the G, illustrating the principle of closure, the E is made up of black squares, indicating proximity, the S has a bar going through it such that the middle part of the S is behind the bar, illustrating continuation, the Ts are both striped, illustrating similarity, and the A and L are smushed together, with a tree appearing within the A (the bar is missing across the A) to illustrate figure/ground."}
You can read about the gestalt rules [here](https://en.wikipedia.org/wiki/Principles_of_grouping), but they are also demonstrated in the figure above.
In graphics, we can leverage the Gestalt principles of grouping to create order and meaning.
If we color points by another variable, we are creating groups of similar points which assist with the perception of groups instead of individual observations.
If we add a trend line, we create the perception that the points are moving "with" the line (in most cases), or occasionally, that the line is dividing up two groups of points.
Depending on what features of the data you wish to emphasize, you might choose different aesthetics mappings, facet variables, and factor orders.
::: callout-caution
Suppose I want to emphasize the change in life expectancy between 1982 and 2007. For this, we'll use the Gapminder [@gapminder] data which is found in the `gapminder` packages in R and python.
I could use a bar chart (showing only 4 countries for space):
::: panel-tabset
#### R {.unnumbered}
```{r chart-emphasis-bar-r, fig.height = 4, fig.width = 6}
library(gapminder)
library(dplyr)
gapminder %>%
filter(year %in% c(1982, 2007)) %>%
filter(country %in% c("Korea, Rep.", "China", "Afghanistan", "India")) %>%
ggplot(aes(x = country, y = lifeExp, fill = factor(year))) +
geom_col(position = "dodge") +
coord_flip() +
ylab("Life Expectancy")
```
#### Python {.unnumbered}
```{python chart-emphasis-bar-py, fig.height = 4, fig.width = 6}
# %pip install gapminder
from gapminder import gapminder
import pandas as pd
import seaborn.objects as so
my_gap_82_07 = my_gap[my_gap.year.isin([1982,2007])]
subdata = my_gap_82_07[my_gap_82_07.country.\
isin(["Korea, Rep.", "China", "Afghanistan", "India"])]
subdata = subdata.assign(yearFactor=pd.Categorical(subdata.year))
plot = so.Plot(subdata, x = "country", y = "lifeExp", color = "yearFactor").\
add(so.Bar(), so.Dodge()).\
label(y = "Life Expectancy")
plot.show()
```
```{python}
#| include: false
import matplotlib.pyplot as plt
plt.clf()
```
:::
Or, I could use a line chart
::: panel-tabset
#### R {.unnumbered}
```{r chart-emphasis-line-r, fig.height = 4, fig.width = 6}
gapminder %>%
filter(year %in% c(1982, 2007)) %>%
filter(country %in% c("Korea, Rep.", "China", "Afghanistan", "India")) %>%
ggplot(aes(x = year, y = lifeExp, color = country)) +
geom_line() +
ylab("Life Expectancy")
```
#### Python {.unnumbered}
```{python chart-emphasis-line-py, fig.height = 4, fig.width = 6}
subdata2 = my_gap_82_07[my_gap_82_07.country.\
isin(["Korea, Rep.", "China", "Afghanistan", "India"])]
plot = so.Plot(subdata, x = "year", y = "lifeExp", color = "country").\
add(so.Lines()).\
label(y = "Life Expectancy")
plot.show()
```
```{python}
#| include: false
import matplotlib.pyplot as plt
plt.clf()
```
:::
Or, I could use a box plot
::: panel-tabset
#### R {.unnumbered}
```{r chart-emphasis-boxplot-r, fig.height = 4, fig.width = 6}
gapminder |>
filter(year %in% c(1982, 2007)) |>
ggplot(aes(x = factor(year), y = lifeExp)) +
geom_boxplot() +
ylab("Life Expectancy")
```
#### Python {.unnumbered}
```{python chart-emphasis-boxplot-py, fig.height = 4, fig.width = 6}
import seaborn as sns
subdata3 = my_gap_82_07.assign(yearFactor=pd.Categorical(my_gap_82_07.year))
sns.boxplot(data = subdata3, x = "year", y = "lifeExp")
plt.show()
```
```{python}
#| include: false
import matplotlib.pyplot as plt
plt.clf()
```
:::
Which one best demonstrates that in every country, the life expectancy increased?
The line segment plot connects related observations (from the same country) but allows you to assess similarity between the lines (e.g. almost all countries have positive slope).
The same information goes into the creation of the other two plots, but the bar chart is extremely cluttered, and the boxplot doesn't allow you to connect single country observations over time.
So while you can see an aggregate relationship (overall, the average life expectancy increased) you can't see the individual relationships.
:::
The aesthetic mappings and choices you make when creating plots have a huge impact on the conclusions that you (and others) can easily make when examining those plots.[^10-graphics-4]
[^10-graphics-4]: See [this paper](https://doi.org/10.1080/10618600.2016.1209116) for more details. This is the last chapter of my dissertation, for what it's worth. It was a lot of fun. (no sarcasm, seriously, it was fun!)
## Representing Data Accurately
In order to read data off of a chart correctly, several things must happen in sequence:
1. The data must be accurately written to the chart, that is, the transformation from data -> geometry must be accurate
2. The geometric objects that make up the chart must be perceived accurately - the mapping from geometric object size or location to mental model of geometric object size or location must be correct.
3. The mental model of geometric object size or location must be accurately converted to a numerical value. We know that this isn't lossless, but we hope that this is at least reasonably accurate.
If step 1 is not done correctly, the chart is misleading or inaccurate.
However, steps 2 and 3 depend on our brains accurately **perceiving** and **estimating** information mentally.
These steps can involve a lot of effort, and as mental effort increases, we tend to take shortcuts.
Sometimes, these shortcuts work well, but not always.
When you design a chart, it's good to consider what mental tasks viewers of your chart need to perform.
Then, ask yourself whether there is an equivalent way to represent the data that requires fewer mental operations, or a different representation that requires **easier** mental calculations.
```{r accuracy-guidelines}
#| fig-cap: Which of the lines is the longest? Shortest? It is much easier to determine the relative length of the line when the ends are aligned. The three lines have the same length in both panels, but the operation is much more difficult when the lines do not start or end at the same place.
#| fig-alt: A chart with two panels. The left panel is labeled 'Aligned scale' and features three lines of different lengths which all start at the bottom of the chart. The right panel is labeled 'Unaligned scale' and features three lines of different lengths which start at different locations. It is easier to compare the lengths of the lines in the left panel, since they all start in the same place.
#| fig-width: 8
#| fig-height: 4
#| out-width: 75%
#| echo: false
segs <- bind_rows(
tibble(x = 1:3, y1 = 0, y2 = c(2.5, 2.75, 2.25), type = "Aligned scale"),
tibble(x = 1:3, y1 = c(1, 0, .5), y2 = c(1, 0, .5) + c(2.5, 2.75, 2.25), type = "Unaligned scale")
)
ggplot(segs, aes(x = x, xend = x, y = y1, yend = y2)) +
geom_segment() + facet_wrap(~type) + coord_fixed(ratio = 0.5) +
theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank())
```
When making judgments corresponding to numerical quantities, there is an order of tasks from easiest (1) to hardest (6), with equivalent tasks at the same level.[^10-graphics-5]
[^10-graphics-5]: See [this paper](https://www.doi.org/10.2307/2288400) for the major source of this ranking; other follow-up studies have been integrated, but the essential order is largely unchanged^[Interestingly, while it is widely believed that @clevelandGraphicalPerceptionTheory1984 established this hierarchy experimentally, only a few of the items were ranked based on experiments in that paper and the follow up papers.]. Most of the items in this ranking were not examined in the linked paper, but are a synthesis of different experiments and conceptual knowledge in psychology as well as statistical graphics.
1. Position (common scale)
2. Position (non-aligned scale)
3. Length, Direction, Angle, Slope
4. Area
5. Volume, Density, Curvature
6. Shading, Color Saturation, Color Hue
If we compare a pie chart and a stacked bar chart, the bar chart asks readers to make judgments of position on a non-aligned scale, while a pie chart asks readers to assess angle.
This is one reason why pie charts tend not to be a good general option -- people must compare values using area or angle instead of position or length, which is a more difficult judgment under most circumstances.
When there are a limited number of categories (2-4) and you have data that is easily compared to quarters of a circle, it may be justifiable to use a pie chart over a stacked bar chart - some studies have shown that pie charts are preferable under these conditions.
As a general rule, though, we have an easier time comparing position than angle or area.
```{r pie-vs-bar}
#| echo: false
#| fig-width: 4
#| fig-height: 4
#| layout-ncol: 2
#| fig-scap: ["Stacked bar chart", "Pie chart"]
#| fig-cap: Stacked bar and pie charts showing the relative proportion of people in North America living in the US, Canada, and Mexico in 2007. Which chart is easier to read relative information (e.g. there are about 3x as many people living in Mexico as Canada) from? Which chart is easier to estimate raw proportions (e.g. the US makes up about 70% of the population of North America) from?
#| fig-alt: ['A stacked bar chart showing the proportion of people living in North America by country', 'A pie chart showing the proportion of people living in North America by country']
gap_nam <- filter(gapminder, country %in% c("Canada", "United States", "Mexico"))
filter(gap_nam, year == 2007) |>
ggplot(aes(fill = country, x = 1, weight = pop)) +
geom_bar(position = "fill") +
ylab("% Population") +
scale_x_continuous(limits = c(0.5, 1.5), breaks = NULL, name = "") +
coord_fixed() +
ggtitle("Relative Population of North America\nby Country, 2007") +
scale_fill_manual(values = c("Canada" = "red", "Mexico" = "green4", "United States" = "blue"))
filter(gap_nam, year == 2007) |>
ggplot(aes(fill = country, x = 1, weight = pop)) +
geom_bar(position = "fill") +
ylab("% Population") +
scale_x_continuous(limits = c(0.5, 1.5), breaks = NULL, name = "") +
ggtitle("Relative Population of North America\nby Country, 2007") +
scale_fill_manual(values = c("Canada" = "red", "Mexico" = "green4", "United States" = "blue")) +
coord_polar(theta = "y")
```
When creating a chart, it is helpful to consider which variables you want to show, and how accurate reader perception needs to be to get useful information from the chart.
In many cases, less is more - you can easily overload someone, which may keep them from engaging with your chart at all.
Variables which require the reader to notice small changes should be shown on position scales (x, y) rather than using color, alpha blending, etc.
Consider the hierarchy of graphical tasks again.
You may notice a general increase in dimensionality from 1-3 to 4 (2d) to 5 (3d).
In general, showing information in 3 dimensions when 2 will suffice can be misleading.
Just *how* misleading depends a lot on the type of chart you're using.
Most of the time, the addition of an extra dimension causes an increase in chart area allocated to the item that is disproportionate to the actual numerical value being represented.
```{r disproportionate-pixels, include = F}
imgs <- list.files("../images/wrangling", "3d_pie", full.names = T)
tmp <- tibble(portfolio = c("Cash", "Bond", "Stocks"),
pct = c(.05, .35, .5),
img = purrr::map(imgs, ~png::readPNG(source = .)),
px = purrr::map_dbl(img, ~sum(.[,,4] != 0 ))) %>%
mutate(px_pct = px/sum(px))
tmp
```
{#fig-disproportionate-pixels fig-alt="A three-dimensional pie chart, where each of the sections has a different height - this seems to double-encode the value represented by the angle of the pie slice. As a result, the area of the chart devoted to one sector is vastly larger than the area of the chart devoted to the other sectors, even though the true values are not nearly so different."}
[Ted ED: How to spot a misleading graph - Lea Gaslowitz](https://youtu.be/E91bGT9BjYk)
[Business Insider: The Worst Graphs Ever](https://www.businessinsider.com/the-27-worst-charts-of-all-time-2013-6)
Extra dimensions and other annotations are sometimes called "chartjunk" and should only be used if they contribute to the overall numerical accuracy of the chart (e.g. they should not just be for decoration).
## References {#sec-good-graphics-refs}