-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathquantileplot.Rmd
180 lines (133 loc) · 7.84 KB
/
quantileplot.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
title: "quantileplot"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{quantileplot}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
dpi = 300,
fig.width = 6.5,
fig.height = 4,
out.width = "650px"
)
```
This R package has a companion working paper.
>Lundberg, Ian, Robin C. Lee, and Brandon M. Stewart. 2021. ["The quantile plot: A visualization for bivariate population relationships."](https://ilundberg.github.io/quantileplot/doc/explainer.html) Working paper.
The `quantileplot` package visualizes bivariate associations. When summarizing bivariate data, a best-fit regression line often obscures important trends---it is too simple. A scatter plot shows all the data but overwhelms the viewer---it is too complicated. A `quantileplot` is a middle ground with three components.
1. The marginal density of the predictor variable.
2. The conditional density of the outcome at selected values of the predictor.
3. Smooth curves for conditional quantiles of the outcome as functions of the predictor.
A `quantileplot` may be useful in many settings. One example is when
* the predictor is skewed, so that (1) conveys useful information,
* the outcome is skewed, so that (2) conveys useful information, and
* the outcome is heteroskedastic, so that (3) conveys a set oof distinct trends.
This vignette illustrates the package functionality with a simulated example.
# Installation
The `quantileplot` package is available via GitHub.
* First, install the `devtools` package
* Then, type `devtools::install_github("ilundberg/quantileplot")`
# Basic functionality
Suppose we have a continuous predictor $X$ and a continuous outcome $Y$.
```{r}
library(quantileplot)
x <- rbeta(1000,1,2)
y <- log(1 + 9 * x) * rbeta(1000, 1, 2)
sim_data <- data.frame(x = x, y = y)
```
A call to the `quantileplot` function, produces the most basic quantile plot.
```{r, results = F}
quantileplot(y ~ s(x), data = sim_data)
```
A call to `quantileplot` may take some time to compute. In the simulated setting above, the time was 4 seconds for 1,000 observations and 24 seconds for 10,000 observations.
# Visualizing uncertainty
This most basic version of the visualization does not present any estimates of uncertainty. There are two options for visualizing uncertainty.
## Uncertainty method 1: Pointwise confidence bands
The argument `show_ci = T` will layer 95\% pointwise credible interval bands on top of the estimated quantile curves. A credible interval is the Bayesian analog to a confidence interval, here supported by the default variance estimation methods in the `qgam` package.
```{r, results = F}
quantileplot(y ~ s(x), data = sim_data, show_ci = T)
```
Importantly, these are **pointwise** credible intervals. At each predictor value $X = x$, they are designed to contain the middle 95\% of the posterior distribution of the quantile of $Y$ given $X = x$. This is distinct from uncertainty statements about the entire curve. For example, it would be incorrect to conclude that over 95\% of repeated draws the entire curve would fall within the plotted band.
The `ci` argument allows the user to specify a credible value other than 0.95 (e.g. for 90\% confidence bands.)
## Uncertainty method 2: Posterior samples of curves
One may want to visualize uncertainty by simulating a series of hypothetical curves. The argument `uncertainty_draws = 10` will add a panel of plots below the main plot. In each plot, the 10 solid black line depicts the point estimate of the curve. The gray lines depict curves sampled from the posterior distribution.
```{r, results = F, fig.height = 5}
quantileplot(y ~ s(x), data = sim_data, uncertainty_draws = 10)
```
Importantly, these curves are not like a confidence interval. They do not show the user a range such that the truth would fall in that range with some probability. Instead, they are useful to convey to the viewer the more basic idea that the estimated curve is only our best guess for a curve that is statistically uncertain.
# Visualizing the raw data
After creating a `quantileplot` object, you can convert that object into an analogous scatter plot to visualize the raw data. If desired in a large sample, you can use the `fraction` argument to plot some random fraction of the raw data.
```{r, results = F, fig.height = 5}
qp <- quantileplot(y ~ s(x), data = sim_data)
scatter.quantileplot(qp, fraction = .5)
```
# Customizing the visualization
There are two ways to customize the plot: with arguments and by manually modifying the resulting `ggplot2` object.
## Customization 1: Arguments to `quantileplot`
You can customize many features of the plot with arguments to the `quantileplot` function.
```{r, results = F}
quantileplot(
y ~ s(x),
data = sim_data,
# Provide axis titles
xlab = "A name for the predictor",
ylab = "A name for the outcome",
# Customize the number of vertical slices
slice_n = 3,
# Customize which quantiles are depicted
quantiles = c(.3,.5,.7),
# Denote quantiles by colors with labels instead of colors
quantile_notation = "label"
)
```
If predictors are extremely skewed, you may only want to visualize part of the space. For example, the code below restricts the visualization to the region of $X\in(0,.5)$ and $Y\in (0,1)$.
```{r, results = F}
quantileplot(y ~ s(x),
data = sim_data,
x_data_range = c(0,.5),
y_data_range = c(0,1),
show_ci = T)
```
Note that the vertical densities are redistributed across the user-specified `x_data_range`.
## Customization 2: Make modifications as in `ggplot2`
Because the basic `quantileplot` contains a `ggplot2` object in the `plot` element, you can modify the output by providing plotting layers. For instance, you can add a custom title, change colors, modify axes, etc.
```{r, results = F}
library(ggplot2)
my_plot <- quantileplot(y ~ s(x), data = sim_data)
my_plot$plot +
ggtitle("A custom title for the plot") +
theme_light() +
scale_color_manual(values = rainbow(5),
guide = guide_legend(reverse = TRUE, label.position = "left",
title = "Custom\nlegend\ntitle and\ncolors")) +
xlab(expression(Custom~axis~title~could~have~something~bold(bold))) +
scale_y_continuous(breaks = c(0,1,2),
labels = c("Custom\nlabel at 0",
"Another\ncustom\nlabel at 1",
"Custom\nlabel at 2"),
name = "Custom y-axis\nbreaks and\nrotated title") +
theme(axis.title.y = element_text(angle = 0, vjust = .5))
```
If you want to modify the axis limits after the fact, you need to use `coord_cartesian()` because this is how `quantileplot()` set the axis limits.
```{r, results = F}
my_plot <- quantileplot(y ~ s(x), data = sim_data, quantile_notation = "label")
my_plot$plot +
coord_cartesian(xlim = c(0,2)) +
annotate(geom = "label", x = 1.75, y = 1,
label = "Extra space\nadded with\ncoord_cartesian()\nto allow more\nroom for\nannotations.",
size = 3)
```
## Additional notes
In some settings, estimation of quantile curves is computationally intensive. If you are modifying some aspect of the densities and want to use a cache from a previous call for the quantile curves, you can use the \code{previous_fit} argument.
```{r, results = "hide"}
quantileplot(y ~ s(x), data = sim_data, previous_fit = my_plot)
```
You can also pass other arguments to \code{mqgam} through the \code{...} at the end of the function call. For instance, you can specify rather than learn the log learning rate as discussed in the \code{mqgam} documentation. Note how these two plots have very different wiggliness of the estimated quantile curves.
```{r, results = F}
quantileplot(y ~ s(x), data = sim_data, lsig = log(10))
quantileplot(y ~ s(x), data = sim_data, lsig = log(.01))
```