-
Notifications
You must be signed in to change notification settings - Fork 18
/
nested.Rmd
303 lines (282 loc) · 10.9 KB
/
nested.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
# Nested Forward Mode
Forward-mode automatic differentiation is defined with respect to an
arbitrary scalar type. With a real scalar type, forward mode computes
derivatives. By taking the scalar inside forward mode to itself be an
automatic differentiation variable, higher-order derivatives may be
calculated, even though only first-order derivatives need be defined.
## Forward nested in forward
Consider a dual number $\langle u, \dot{u} \rangle,$ where $u$ is a
scalar variable and $\dot{u} = \frac{\partial u}{\partial x}$ for some
distinguished independent variable $x$. Suppose that instead of a
simple scalar, $u = \langle v, \acute{v} \rangle$ is itself a
forward-mode autodiff variable defining derivatives with respect to
$w$, so that $\acute{v} = \frac{\partial v}{\partial w}.$ A different
diacritic is used above the nested variable to distinguish it from the
outer diacritic dual number. A nested forward-mode autodiff variable
is thus of the form $\big\langle \langle v, \acute{v} \rangle, \ \langle
\dot{v}, \dot{\acute{v}} \rangle \big\rangle,$ where
$$
\acute{v} = \frac{\partial v}{\partial w}
\qquad
\dot{v} = \frac{\partial v}{\partial x}
\qquad
\dot{\acute{v}} = \frac{\partial^2 v}{\partial x \partial w}.
$$
The tangent values of nested forward-mode autodiff variables are
initialized following the derivatives. For the input variable $w,$
the initial nested autodiff variable is
$$
\begin{array}{rcl}
\langle \langle w, \acute{w} \rangle, \,
\langle \dot{w}, \dot{\acute{w}} \rangle \rangle
& = &
\langle \langle w, \, \frac{\partial w}{\partial w} \rangle, \
\langle \frac{\partial w}{\partial x}, \, \frac{\partial^2
w}{\partial x \partial w} \rangle \rangle
\\[4pt]
& = & \langle \langle w, 1 \rangle, \, \langle 0, 0 \rangle \rangle.
\end{array}
$$
Input variable $x$ corresponds to an autodiff variable
$$
\begin{array}{rcl}
\langle \langle x, \acute{x} \rangle, \,
\langle \dot{x}, \dot{\acute{x}} \rangle \rangle
& = &
\langle \langle x, \, \frac{\partial x}{\partial w} \rangle, \
\langle \frac{\partial x}{\partial x}, \, \frac{\partial^2
x}{\partial x \partial w} \rangle \rangle
\\[4pt]
& = & \langle \langle x, 0 \rangle, \, \langle 1, 0 \rangle \rangle.
\end{array}
$$
Any other input variable $u$ such that $u \neq x$ and $u \neq y$ will
be initialized with nested autodiff variable
$$
\begin{array}{rcl}
\langle \langle u, \acute{u} \rangle, \,
\langle \dot{u}, \dot{\acute{u}} \rangle \rangle
& = &
\langle \langle u, \, \frac{\partial u}{\partial w} \rangle, \
\langle \frac{\partial u}{\partial x}, \, \frac{\partial^2
u}{\partial x \partial w} \rangle \rangle
\\[4pt]
& = & \langle \langle x, 0 \rangle, \, \langle 0, 0 \rangle \rangle.
\end{array}
$$
Recall the definition of exponentiation and multiplication for
forward-mode autodiff variables,
$$
\exp(\langle u, \dot{u} \rangle)
= \langle \exp(u), \exp(u) \cdot \dot{u} \rangle,
$$
and
$$
\langle u, \dot{u} \rangle
\cdot \langle z, \dot{z} \rangle
= \langle u \cdot z, \dot{u} \cdot z + u \cdot \dot{z} \rangle.
$$
Plugging $\langle v, \acute{v} \rangle$ in for $u$ in the forward-mode
rule for exponentiation,
$$
\begin{array}{rcl}
\exp\!\big(\big\langle \langle v, \acute{v} \rangle, \
\langle \dot{v}, \dot{\acute{v}} \rangle \big\rangle\big)
& = &
\big\langle \exp(\langle v, \acute{v} \rangle), \
\exp(\langle v, \acute{v} \rangle)
\cdot \langle \dot{v}, \, \dot{\acute v} \rangle
\big\rangle
\\[4pt]
& = &
\big\langle \langle \exp(v), \, \exp(v) \cdot \acute{v} \rangle, \ \
\langle \exp(v), \, \exp(v) \cdot \acute{v} \rangle
\cdot \langle \dot{v}, \ \dot{\acute v} \rangle
\big\rangle
\\[4pt]
& = &
\big\langle \langle \exp(v), \, \exp(v) \cdot \acute{v} \rangle, \ \
\langle \exp(v) \cdot \dot{v}, \,
\exp(v) \cdot \acute{v} \cdot \dot{v}
+ \exp(v) \cdot \dot{\acute{v}} \rangle
\big\rangle
\end{array}
$$
The value is correct,
$$
\exp(v) = \exp(v).
$$
The first nested derivative is also correct,
$$
\frac{\partial \exp(v)}{\partial w} = \exp(v) \cdot \frac{\partial
v}{\partial w} = \exp(v) \cdot \acute{v},
$$
as is the other nested derivative,
$$
\frac{\partial \exp(v)}{\partial x} = \exp(v) \cdot \frac{\partial
v}{\partial x} = \exp(v) \cdot \dot{v}.
$$
The second derivative also checks out,
$$
\begin{array}{rcl}
\frac{\partial^2 \exp(v)}{\partial x \partial w}
& = & \frac{\partial}{\partial x} \frac{\partial \exp(v)}{\partial w}
\\[4pt]
& = & \frac{\partial}{\partial x}
\left( \exp(v) \cdot \frac{\partial v}{\partial w} \right)
\\[4pt]
& = &
\frac{\partial}{\partial x}(\exp(v))
\cdot \frac{\partial v}{\partial w}
+ \exp(v) \cdot \frac{\partial}{\partial x} \frac{\partial v}{\partial w}
\\[4pt]
& = &
\exp(v) \cdot \frac{\partial v}{\partial x} \cdot \frac{\partial v}{\partial w}
+ \exp(v) \cdot \frac{\partial}{\partial x} \frac{\partial v}{\partial w}
\\[4pt]
& = &
\exp(v) \cdot \dot{v} \cdot \acute{v}
+ \exp(v) \cdot \dot{\acute{v}}.
\end{array}
$$
Working through a complete example, suppose
$$
f(a, b, c) = a \cdot \exp(b \cdot c).
$$
To evaluate $\frac{\partial^2}{\partial a \partial b}
f(2.1, 1.5, -0.3),$ the dual numbers will be of the form
$$
\langle
\langle u, \acute{u} \rangle, \,
\langle \dot{u}, \dot{\acute{u}} \rangle
\rangle
=
\langle
\langle u, \frac{\partial u}{\partial a} \rangle, \,
\langle \frac{\partial u}{\partial b}, \frac{\partial^2 u}{\partial
a \partial b} \rangle
\rangle.
$$
Thus the variables are initialized with appropriate derivatives,
$$
\begin{array}{rcl}
a & \Rightarrow & \langle \langle 2.1, 1 \rangle, \, \langle 0, 0 \rangle \rangle
\\
b & \Rightarrow & \langle \langle 1.5, 0 \rangle, \, \langle 1, 0 \rangle \rangle
\\
c & \Rightarrow & \langle \langle -0.3, 0 \rangle, \, \langle 0, 0 \rangle \rangle.
\end{array}
$$
Then evaluation is carried out from most nested outward, with
$$
\begin{array}{rcl}
b \cdot c
& \Rightarrow &
\langle \langle -0.45, 0 \rangle, \, \langle -0.3, 0 \rangle \rangle
\\[4pt]
\exp(b \cdot c)
& \Rightarrow &
\langle \langle 0.64, 0 \rangle, \, \langle -0.19, 0 \rangle \rangle
\\[4pt]
a \cdot \exp(b \cdot c)
& \Rightarrow &
\langle \langle 1.3, 0.64 \rangle, \, \langle -0.40, -0.19 \rangle \rangle
\end{array}
$$
Reading the results out of the nested dual number, the value is 1.3,
the derivative with respect to $a$ is 0.64, and the derivative with
respect to $b$ is -0.4, and the second derivative with respect to $a$
and $b$ is -0.19. The analytic second derivative is
$$
\frac{\partial}{\partial a \, \partial b} a \cdot \exp(b \cdot c)
=
c \cdot \exp(b \cdot c).
$$
Plugging in $a = 2.1, b = 1.5, c = -0.3$ yields $-0.19$, matching the
results obtained through automatic differentiation.
The simplicity of the nested method derives from never having to
define anything other than first derivatives. Computing second-order
derivatives of the exponential only required the rules for first-order
derivatives of exponentiation and first-order derivatives of products.
## Hessians with forward mode
The Hessian matrix of all second derivatives for a function
$f:\mathbb{R}^N \rightarrow \mathbb{R}$ may be computed by running
forward-mode autodiff $\binom{N}{2} + N$ times, once for each unique
pair ${m, n}$ and one for second derivatives with respect to a single
variable. Assuming evaluation of $f(x)$ for $x \in \mathbb{R}^N$ is
$\mathcal{O}(g(N))$, forward-mode autodiff can be used to compute
Hessians in $\mathcal{O}(N^2 \cdot g(N))$ time and $\mathcal{O}(N^2)$ space.
## Third-order derivatives and beyond
The same logic applies for further nesting of forward mode within
forward mode within forward mode. The result will have eight nested
values (four for the value and four for the tangent). From most to
least nested, these values represent third-order derivatives
$\frac{\partial^3}{\partial v \partial w \partial x},$ the three pairs
of second-order derivatives, the three first-order derivatives, and
the value. Third derivatives are simple to compute this way, but
provide quite the bookkeepoing obstacle to manual computation, as
should be clear from the explicit evaluation of second-order
derivatives for the exponential function using nested forward mode.
The nesting may be iterated to compute fourth-order derivatives, etc.
## Reverse nested in forward
A more efficient way to compute Hessians is to nest reverse-mode
autodiff in forward-mode autodiff. Thus rather than scalars in
$\langle u, \dot{u} \rangle,$ the $u$ are taken to be reverse-mode
autodiff variables. The forward pass records the subexpressions used
in the evaluation of $f(\langle u, \dot{u} \rangle)$ as before. The
reverse pass is then run from the final result $\dot{u}$, to compute
the adjoint $\overline{\dot{u}}.$ This is possible because all the
calculations used to produce $\dot{u}$ are scalar operations.
Suppose $f : \mathbb{R}^N \rightarrow \mathbb{R}$. Let $\dot{u} =
\frac{\partial u}{\partial x_n}$ for some $n \in 1{:}N.$
If $y = f(x),$ then $\dot{y} = \frac{\partial y}{\partial x_n},$ and
reverse-mode autodiff started from $\dot{y}$ will compute
$$
\overline{\dot{y}}
= \nabla \frac{\partial f(x)}{\partial x_n}
=
\begin{bmatrix}
\frac{\partial}{\partial x_1} \frac{\partial f(x)}{\partial x_n}
\ \cdots \
\frac{\partial}{\partial x_N} \frac{\partial f(x)}{\partial x_n}
\end{bmatrix}
=
\begin{bmatrix}
\frac{\partial^2 f(x)}{\partial x_1 \partial x_n}
\ \cdots \
\frac{\partial^2 f(x)}{\partial x_N \partial x_n}
\end{bmatrix}.
$$
## Computing Hessians with reverse nested in forward
Applying the adjoint method to a forward-mode derivative results in a
row of the Hessian matrix. Thus to compute a Hessian using reverse
mode nested in forward-mode autodiff requires $N$ function
evaluations, one for each row $\nabla \frac{\partial}{\partial x_m}
f(x)$ of the Hessian. Assuming each function evaluation is
$\mathcal{O}(g(n))$ leads to Hessian complexity of $\mathcal{O}(N
\cdot g(N))$ time and $\mathcal{O}(N^2)$ space. For example, if the
time to compute $f(x)$ for $x \in \mathbb{R}^N$ is linear, i.e.,
$\mathcal{O}(N),$ then the Hessian $\nabla \nabla^{\top} f(x)$ can be
computed in $\mathcal{O}(N^2).$
## Higher-order nesting of reverse in forward mode
Applying the adjoint method to forward mode nested in forward mode
leads to third derivatives. Suppose $f:\mathbb{R}^N \rightarrow
\mathbb{R}$ and $x \in \mathbb{R}^N$. Nested forward mode computes
$\acute{\dot{y}} = \frac{\partial^2}{\partial x_m \partial x_n} f(x).$
Applying the adjoint method to this result produces third derivatives,
$$
\overline{\acute{\dot{u}}}
=
\nabla \frac{\partial^2}{\partial x_m \partial x_n} f(x)
=
\begin{bmatrix}
\frac{\partial^3}{\partial x_1 \partial x_m \partial x_n} f(x)
& \cdots &
\frac{\partial^3}{\partial x_N \partial x_m \partial x_n} f(x)
\end{bmatrix}.
$$
Computing all third-order derivatives of $f$ at $x$ requires
$\mathcal{O}(N^2)$ evaluations of $f(x)$, each using reverse mode
nested in forward mode further nested in forward mode. Thus if
evaluating $f$ is $\mathcal{O}(g(N))$ then evaluating all third order
derivatives requires $\mathcal{O}(g(N) \cdot N^2)$ time.