-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathspline_review_presentation.Rmd
425 lines (294 loc) · 11 KB
/
spline_review_presentation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
---
title: "Three Levels of Spline Models: "
subtitle: 'Understanding, Application and Beyond'
author: "Boyi Guo"
institute: |
| Department of Biostatistics
| University of Alabama at Birmingham
date: "April 8th, 2021"
output:
beamer_presentation:
theme: "Szeged"
colortheme: "spruce"
toc: FALSE
number_section: false
slide_level: 2
classoption: "aspectratio=169"
header-includes:
- \usepackage{bm}
bibliography:
- bibliography.bib
nocite: '@*'
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(ggpubr)
library(viridis)
```
# Who Am I?
- 4th-year Ph.D. student in BST @ UAB
- Dissertation: Bayesian high-dimensional additive models
- Background:
- Balanced methodology & collaboration
- Experienced R programmer & package creator
- Graduate in about 1 year, Looking for
- Faculty postion in Biostat
- Post-doc in methodology dev. on HD, causal inference
<!-- Enough commercial and self promotion, lets get to the topic-->
# Overview
* Understanding
* Spline Concepts
* Regression Splines
* Application
* Non-linear Effect Modifier
* Non-proportional Hazard Models
* Generalized Additive Mixed Model
* Beyond
* Spline Surface
* Smoothing Splines
* Function Selection in High Dimension
# Objectives
* To review the basic concepts of spline
* To raise awareness of advanced spline applications
<!-- Plan analysis accordingly -->
Disclaimer
* Minimum level of theoretical justification
* No discussion on model fitting algorithms or software implementations
<!-- primarily R user, Hence, less familiar of spline implementation in other software -->
# Understanding
## Motivation
> "It is extremely unlikely that the true (effect) function f(X) (_on the outcome_) is actually linear in X."
\hspace*{2cm}
> --- @hastie2009elements PP. 139
<!-- For example, the relationship between age and COVID mortality. -->
## Previous Solutions:
* Variable categorization: e.g. using quartiles of a continuous variable in a model
* Assume all subjects within a group shares the same risk/effect
* Loss of data fidelity
* Polynomial regression:
$$
y = \beta_0 + \beta_1X + \beta_2 X^2 + \dots + \beta_mX^m + \epsilon
$$
* Precision issues, e.g. $X$ is blood pressure measure, and $X^3$ would be extremely large
* Goodness of fit: deciding which order of polynomial term should be included
## Spline
* A spline is a piece-wise function where each piece is a polynomial function of order $m$
* A.k.a. non-parametric regression, semi-parametric regression, (generalized) additive model
* Can be easily incorporated in linear regression, generalized linear regression, Cox regression, as __regression splines__
## Spline Components
* Order/degree of the polynomial function, $m$
* Normally, $m=3$, i.e. cubic spline is sufficient
* An increasing breakpoints sequence $\tau$
* a.k.a. knots, where the piece-wise functions joint
* e.g. $k \equiv |\tau| = 5$, equally spaced
* Continuity conditions at knots, $v$
* to control the smoothness between pieces
* e.g. continuous at second derivative for cubic spline
## Toy Example
A spline function of the variable $X$, $f(X)$, of order $m=0$ with $k=2$ knots ($\tau_1 = 1, \tau_2 = 5$) and no continuity condition
\begin{equation*}
f(X) =
\begin{cases}
2, & X \leq 1 \\
1.2, & 1 < X \leq 5\\
1.5, & X > 5
\end{cases}
\end{equation*}
<!-- This is the business of column space force, aka linear algebra mathematicians. -->
## Visual Demonstration
![](Figure/linear_spline.jpg){height=80%}
Figure from @hastie2009elements PP.142
## Cubic Spline
* Cubic polynomial in each piece-wise function, i.e. m = 3
* E.g. truncated power bases with 3 knots at $\tau_1, \tau_2, \tau_3$
\begin{align*}
f(X) &= \beta_0 + \beta_1 X+\beta_2 X^2 + \beta_3 X^3 + \beta_4(X-\tau_1)^3_+\\
& + \beta_5(X-\tau_2)^3_+ + \beta_6(X-\tau_3)^3_+\\
&= \boldsymbol \beta^T \bm B(X)
\end{align*}
* Continuous at second derivative
* The smoothest possible interpolant
* Alternative representation
* B-spline bases for stable computation
* Natural cubic spline for linearity beyond boundary knots ($f^{'''}(X) = 0$)
## Cubic Spline
```{r, echo=FALSE, out.height="75%", fig.align='center'}
knitr::include_graphics("Figure/cubic_spline.jpg")
```
Figure from @hastie2009elements PP.143
## Regression Splines
Given the matrix form of the spline function $f(X) = \bm \beta^T \bm B(X)$,
* Linear regression:
$$y_i \sim N(\bm \beta^T \bm B(X_i) + \bm \beta_{cov}^T \bm Z_i, \sigma^2)$$
* Generalized linear regression:
$$E(y_i) = g^{-1}(\bm \beta^T \bm B(X_i) + \bm \beta_{cov}^T \bm Z_i), Y_i \sim EF$$
* Cox regression:
$$h(t_i) = h_0(t_i) exp(\bm \beta^T \bm B(X_i) + \bm \beta_{cov}^T \bm Z_i)$$
Model fitting and diagnostic remain the same
## Software Implementation
Two-step procedure
* Create the 'design' matrix of the spline function $B(X)$
* Fit the preferred model including $B(X)$ as covariates / predictors
```{r eval=FALSE}
library(splines) # Package for b-spline
x_spline <- bs(x, degree = 3, # cubic polynomial
df = 8) # 5 (df-degree) knots
glm(y ~ x_spline) # Fitting the spline model
# Equivalently
glm(y ~ bs(x, degree=3, df=8))
```
## Variability Band
* A delicate statistical problem
* Confidence about spline functions VS point estimates
* Most commonly used: 95% point-wise confidence interval
* Can be calculate using statistical contrasts for regression splines
## Hypothesis Testing
* Two hypothesis tests
* If the non-linear terms are necessary:
$$
H_0: \beta_2 = \beta_3 = \dots = 0
$$
* If the variable is necessary in the model
$$
H_0: f(x) = 0
$$
* Be careful when reading program manual
## Rule of Thumb
* Cubic splines for smooth interpolant
* B-spline for computation stability
* 3-5 equally spaced knots
* Transform variables with extreme values for computational stability
* e.g. prefer $f(\log(X))$ over $f(X)$ when modeling CRP
* Examine outlier's effect on statistically significant non-linear relationship
* Survival Model
* Knots are decided by equal number of events in each group
* Defer to @Sleeper1990 for practical guidance
<!-- Insert jokes about rule of thumb -->
# Application
## Varying Coefficient
To model a non-constant effect of the variable $Z$ as a function of another variable $X$
$$
E(y) = f(X) Z,
$$
where $f(X)$ is the varying coefficient of Z
* Example: statistical interaction $\beta XZ$ where $f(X) = \beta X$
* What if the slope of the effect are not constant across the domain of $X$?
## Non-linear Effect Modification
```{r echo=F, fig.align='center', out.height="80%"}
dat <- data.frame(
X = rep(seq(1, 3, length.out=100), 2),
Z = c(rep(0,100),rep(1,100)) %>% factor
) %>%
mutate(
y_ln = 1+0.3*(as.numeric(Z)-1) + 0.5*X*(as.numeric(Z)-1),
y_sp = 1+0.3*(as.numeric(Z)-1) + 0.5*X*(as.numeric(Z)-1) + 0.3*sin(10*X)*(as.numeric(Z)-1)
)
ggarrange(
ggplot(dat) +
geom_line(aes(x = X, y=y_ln, color = Z, group = Z), size = 1)+
ylim(0,3)+
theme_pubclean() +
scale_color_viridis(discrete=TRUE, option = "D"),
ggplot(dat) +
geom_line(aes(x = X, y=y_sp, color = Z, group = Z), size = 1)+
ylim(0,3)+
scale_color_viridis(discrete=TRUE, option = "D")+
theme_pubclean(),
common.legend = TRUE,
legend = "bottom"
)
```
## Non-linear Effect Modification
$$
E(y) = f(X) + f^\prime(X)Z = \beta_{Z=0}^T B(X) + \beta_{Z=1}^TB(X)*Z
$$
* $f(X)$ models the effect of $X$ when $Z=0$
* $f^\prime(X)Z$ models the modifying effect of $Z$ at different values of $X$
* $f^\prime(X)$ is the varying coefficient of $Z$, using a non-linear function, for non-constant slope.
## Non-linear Effect Modification
* Assumptions of consideration
* Should $f(X)$ be linear or non-linear?
* Should $f(X)$ use the same bases as $f^\prime(X)$?
* Should $f(X)$ be the same level of complexity as $f^\prime(X)$?
## Non-proportional Hazard
* Cox PH model assumes proportional hazards, i.e. the hazard/effect of a variable $X$ is independent to time
* Using Time-varying coefficients to model the non-proportional hazards
$$
h(t) = h_0(t)exp(f(t)X)
$$
* Defer to @Gray1992 and references therein
## Mixed Model
To model the non-linear fixed effect while considering random effects
* Good for longitudinal studies or multi-center studies
* Easy to implement: to include your design matrix of $\bm B(X)$ in the fixed effect
* `gamm` in R-package `mgcv`
# Beyond
## Spline Surface
* Model the non-linear interaction between two continuous variables
* Thin-plate splines, tensor product splines
* Thin-plate spline is scale-sensitive
* Recommended when variables are on the same scale
* Tensor product spline is scale-invariant
* Dealing with _over smoothing across boundary_
* Soap film smoothing
* Application:
* Loop, M. S., Howard, G., de Los Campos, G., Al-Hamdan, M. Z., Safford, M. M., Levitan, E. B., & McClure, L. A. (2017). Heat maps of hypertension, diabetes mellitus, and smoking in the continental United States. __Circulation: Cardiovascular Quality and Outcomes, 10(1), e003350.__
## Smoothing Spline
* Motivation:
* To simplify the decision making about the knots
* Idea:
* Set the number of knots to a really large value (k=25, 40, $N$)
* Use variable selection methods, penalized models specifically, to decide the smoothness of the spline
## Objective Functions
Given a spline model $y \sim N(f(X), \sigma^2)$
* Regression spline
$$
\text{arg}\min\limits_{\bm \beta} \sum\limits^n_{i=1} \{y_i - \beta^T B(X_i)\}^2
$$
* Smoothing spline
$$
\text{arg}\min\limits_{\bm \beta} \sum\limits^n_{i=1} \{y_i - \beta^T B(X_i)\}^2 + \lambda \int f^{''}(X)^2dx
$$
* $\lambda$ is a tuning parameter, selected via (generalized) cross-validation
## Statistical Complications
* Estimated degree of freedom due to shrinkage
* Harder to conduct hypothesis testing, and calculate CI
* More decisions when modeling effect modification
* Same smoothness for the spline functions?
* If the same, how to estimate the smoothness
## Function Selection
* Question of interest
* If a variable $X$ has effect on the outcome $Y$
* High-dimensional data analysis, e.g. EHR, Genomics
* Solutions
* Step-wise function selection
* Locally optimal solution
* Not feasible for high-dimensional analysis
* Group penalized models
* Biased estimation
* Global penalization vs local penalization
* Bayesian hierarchical models
* Robust estimation
* Slow...
# Conclusion
* Reviewed concepts of spline
* New insight of advanced spline models
* Same set of variables can lead to many models with different assumptions
* Fit many models and compare
* Explore the inconsistency
* Balance between interpolation and prediction
* "Black box" models for improved prediction
* __Consult with statisticians when not comfortable dealing spline models__
## Great Book
::: columns
:::: column
Wood, S. N. (2017). Generalized additive models: an introduction with R. CRC press.
* Chapter 7 for examples
::::
:::: column
![](Figure/simon_book.jpg){height=70%}
::::
:::
# Q & A
# Reference