-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy pathmatrix-derivatives.Rmd
347 lines (318 loc) · 10.3 KB
/
matrix-derivatives.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
# Matrix Derivatives
If $C = f(A, B)$ where $A,$ $B,$ and $C$ are matrices (or vectors or scalars),
the chain rule remains the same,
$$
\textrm{d}C
= \frac{\partial f(A, B)}{\partial A} \cdot \textrm{d}{A}
+ \frac{\partial f(A, B)}{\partial B} \cdot \textrm{d}{B}.
$$
The total differential notation $\textrm{d}C$ may be understood by
plugging in a scalar $x$ with respect which to differentiate,
$$
\frac{\textrm{d}C}{\textrm{d}x}
= \frac{\partial f(A, B)}{\partial A} \cdot \frac{\textrm{d}{A}}{\textrm{d}x}
+ \frac{\partial f(A, B)}{\partial B} \cdot \frac{\textrm{d}{B}}{\textrm{d}x}.
$$
In the general case, if $C = f(A_1, \ldots, A_N),$ then
$$
\textrm{d}C
= \sum_{n=1}^N \frac{\partial f(A_1, \ldots, A_N)}{\partial A_n} \textrm{d}{A_n}.
$$
If $C$ is an $K \times L$ matrix and $A$ is a $M \times N$ matrix,
then the Jacobian $\frac{\partial C}{\partial A}$ has dimensions $(K
\times L) \times (M \times N).$ These results may be construed as
involving standard vectors and Jacobians by collapsing all matrices to
vectors. Or they may be read directly by raising the type of
operations and multiplying the $(K \times L) \times (M \times N)$
Jacobian by the $M \times N$ matrix differential.^[In the automatic
differentiation literature in computer science, this is sometimes
called a "tensor" product, where "tensor" just means array of
arbitrary dimensionality and should not be confused with the tensor
calculus used in physics.]
## Forward mode
The definitions of values and tangents remain the same when moving to
vector or matrix functions. As with scalars, the tangent of a matrix
(or vector) $U$ is defined relative to a scalar variable $x$ as
$$
\dot{U} = \frac{\partial U}{\partial x}.
$$
This derivative is defined elementwise, with
$$
\dot{U}_{i, j} =\frac{\partial U_{i, j}}{\partial x}.
$$
Forward mode automatic differentiation for matrices follows the chain
rule,
$$
\frac{\partial C}{\partial x}
= \frac{\partial C}{\partial A} \cdot \frac{\partial A}{\partial x}
+ \frac{\partial C}{\partial B} \cdot \frac{\partial B}{\partial x},
$$
which, using tangent notaton, yields
$$
\dot{C}
= \frac{\partial C}{\partial A} \cdot \dot{A}
+ \frac{\partial C}{\partial B} \cdot \dot{B}.
$$
In general, if $C =
f(A_1, \ldots, A_N),$ then
$$
\dot{C} = \sum_{n=1}^N \frac{\partial C}{\partial A_n} \cdot
\dot{A_n}.
$$
As with forward-mode autodiff for scalars, this matrix derivative rule is
straightforward to work out and to implement.
The tangent rules for matrix operations carry over neatly from the
scalar case. For example, $C = A + B$ is the sum of two matrices, the
corresponding tangent rule is
$$
\dot{C} = \dot{A} + \dot{B}.
$$
Here and throughout, matrices used in arithmetic operations will be
assumed to conform to the required shape and size constraints in the
expressions in which they are used. For $A + B$ to be well formed,
$A$ and $B$ must both be $M \times N$ matrices (i.e., they must have
the same number of rows and columns).
Similarly, if $C = A \cdot B$ is the product of two matrices, the
tangent rule is the same as that for scalars,
$$
\dot{C} = \dot{A} \cdot B + A \cdot \dot{B}.
$$
Simple tangent rules exist for many linear algebra operations, such as
inverse. If $C = A^{-1}$, then the tangent rule is
$$
\dot{C} = -C \cdot \dot{A} \cdot C.
$$
Results such as these are derived through algebraic manipulation and
differentiation (see \cite{giles:2008b} for general rules). For
inverse, because
$$
C \cdot A = A^{-1} \cdot A = \textrm{I}.
$$
Differentiating both sides yields
$$
\frac{\partial}{\partial x} C \cdot A
=
\frac{\partial}{\partial x} \textrm{I}.
$$
Replacing with dot notation yields
$$
\dot{C} \cdot A + C \cdot \dot{A} = 0.
$$
Rearranging the terms produces
$$
\dot{C} \cdot A = - C \cdot \dot{A}.
$$
Multiplying both sides of the equation on the right by $A^{-1}$ gives
$$
\dot{C} \cdot A \cdot A^{-1} = -C \cdot \dot{A} \cdot A^{-1}.
$$
This reduces to the final simplified form
$$
\dot{C} = -C \cdot \dot{A} \cdot C,
$$
after dropping the factor $A \cdot A^{-1} = \textrm{I}$ and replacing
$A^{-1}$ with its value $C$.
## Reverse mode
Using the same adjoint notation as for scalars, if $U$ is an $M \times
N$ matrix involved in the computation of a final result $y$, then
$$
\overline{U} = \frac{\partial y}{\partial U},
$$
with entries defined elementwise by
$$
\overline{U}_{i, j}
= \frac{\partial y}{\partial U}[i, j]
= \frac{\partial y}{\partial U_{i, j}}.
$$
The definition applies to vectors if $N = 1$ and row vectors if $M =
1.$
The adjoint method can be applied to matrix or vector functions in the
same way as to scalar functions. Suppose there is a final scalar
result variable $y$ and along the way to computing $y$, the matrix (or
vector) $A$ is used exactly once, appearing only in the subexpression
$B = f(\ldots, A, \ldots).$ By the chain rule,^[The terms in this
equality can be read as vector derivatives by flattening the matrices.
If $A$ is an $M \times N$ matrix and $B$ is a $K \times L$ matrix,
then $\displaystyle \frac{\partial y}{\partial A}$ is a vector of size
$M \cdot N$, $\displaystyle \frac{\partial B}{\partial A}$ is matrix
of size $(K \cdot L) \times (M \cdot N),$ and $\displaystyle
\frac{\partial y}{\partial B}$ is a vector if size $K \cdot L$. After
transposition, the right-hand side is a product of an $(M \cdot N)
\times (K \cdot L)$ matrix and a vector of size $K \cdot L$, yielding
a vector of size $M \cdot N,$ as found on the left-hand side.]
$$
\frac{\partial y}
{\partial A}
= \left(
\frac{\partial B}
{\partial A}
\right)^{\top}
\cdot
\frac{\partial y}
{\partial B}
= \textrm{J}^{\top}_f(A) \cdot \frac{\partial y}{\partial B},
$$
where the Jacobian function $\textrm{J}_f$ is generalized to matrices
by^[Matrix Jacobians may be understood by generalizing definitions to matrices or by flattening.]
$$
\textrm{J}_f(U) = \frac{\partial f(U)}{\partial U}.
$$
Rewriting using adjoint notation,
$$
\overline{A}
= \textrm{J}_f^{\top}(A) \cdot \overline{B},
$$
or in transposed form,
$$
\overline{A}^{\top}
= \overline{B}^{\top} \cdot \textrm{J}_f(A).
$$
The adjoint of an operand is the product of the Jacobian of the
function and adjoint of the result.
An expression $A$ may be used as an operand in multiple expressions
involved in the computation of $y.$ As with scalars, the adjoints need
to be propagated from each result, leading to the fundamental matrix
autodiff rule for a subexpression $B = f(\ldots, A, \ldots)$ involved
in the computation of $y,$
$$
\overline{A}
\ \ {\small +}{=} \ \
\textrm{J}_f^{\top}(A) \cdot \overline{B}.
$$
## Trace algebra
The Jacobian of a function with an $N \times M$ matrix operand and $K
\times L$ matrix result has $N \cdot M \cdot K \cdot L$ elements, one
for each derivative of an output with respect to an input. This
makes it prohibitively expensive in terms of both memory and
computation to store and multiply Jacobians explicitly. Instead,
algebra is used to reduce adjoint computations to manageable sizes.
Suppose the $M \times N$ matrix $C$ is used in the computation of
$y$. By the chain rule,
$$
\begin{array}{rcl}
\textrm{d}y
& = & \sum_{n = 1}^N \sum_{m = 1}^N
\frac{\partial y}{\partial C_{m, n}} \cdot \textrm{d}C_{m, n}
\\[4pt]
& = & \sum_{n = 1}^N \sum_{m = 1}^N
\overline{C}_{m, n} \cdot \textrm{d}C_{m, n}
\\[4pt]
& = & \sum_{n = 1}^N \sum_{m = 1}^N
\overline{C}^{\top}_{n, m} \cdot \textrm{d}C_{m, n}
\\[4pt]
& = & \sum_{n = 1}^N
\overline{C}^{\top}_{n, \cdot} \cdot \textrm{d}C_{\cdot, n}
\\[4pt]
& = & \textrm{Tr}(\overline{C}^{\top} \cdot \textrm{d}C),
\end{array}
$$
where the notation $U_{m, \cdot}$ indicates the $m$-th row of $U$,
$U_{\cdot, m}$ the $m$-th column of $U$, and $\textrm{Tr}(U)$ the
trace of an $N \times N$ square matrix $U$ defined by
$$
\textrm{Tr}(U) = \sum_{n = 1}^N U_{n, n}.
$$
Clarifying that differentiation is with respect to a distinguished
scalar variable $x$ on both sides of the above equation yields
$$
\frac{\textrm{d} y}{\textrm{d} x}
=
\frac{\textrm{d}}{\textrm{d} x}
= \textrm{Tr}\left(
\overline{C}^{\top}
\cdot \frac{\textrm{d}C}{\textrm{d}x}
\right).
$$
Suppose that $C = f(A, B)$ is a matrix function. As noted at the
beginning of this chapter, the chain rule yields
$$
\textrm{d}C
= \frac{\partial f(A, B)}{\partial A} \textrm{d}A
+ \frac{\partial f(A, B)}{\partial B} \textrm{d}B.
$$
Using the result of the previous section and substituting the
right-hand side above for $\textrm{d}C$,
$$
\begin{array}{rcl}
\textrm{d}y
& = & \textrm{Tr}(\overline{C}^{\top} \cdot \textrm{d}C)
\\[4pt]
& = &
\textrm{Tr}\left(
\overline{C}^{\top}
\cdot \left( \frac{\partial f(A, B)}{\partial A} \textrm{d}A
+ \frac{\partial f(A, B)}{\partial B} \textrm{d}B
\right)
\right)
\\[4pt]
& = &
\textrm{Tr}\left(
\overline{C}^{\top}
\cdot \frac{\partial f(A, B)}{\partial A} \textrm{d}A
+
\overline{C}^{\top}
\cdot \frac{\partial f(A, B)}{\partial B} \textrm{d}B
\right)
\\[4pt]
& = &
\textrm{Tr}\left(
\overline{C}^{\top}
\cdot \frac{\partial f(A, B)}{\partial A} \textrm{d}A
\right)
+ \textrm{Tr}\left(
\overline{C}^{\top}
\cdot \frac{\partial f(A, B)}{\partial B} \textrm{d}B
\right).
\end{array}
$$
Recall the transposed form of the adjoint rule,
$$
\overline{A}^{\top}
= \overline{C}^{\top}
\cdot \frac{\partial C}{\partial A},
$$
and similarly for $\overline{B}^{\top}$. Plugging that into the final
line of the previous derivation yields the final form of the
trace rule for matrix functions $C = f(A, B),$
$$
\textrm{Tr}( \overline{C}^{\top} \cdot \textrm{d}C )
=
\textrm{Tr}( \overline{A}^{\top} \cdot \textrm{d}A )
+ \textrm{Tr}( \overline{B}^{\top} \cdot \textrm{d}B ).
$$
This can be generalized to functions of one or more arguments in the
obvious way.
## Examples
For example, if $C = A + B$, then
$$
\frac{\partial C}{\partial A} = 1
\qquad
\frac{\partial C}{\partial B} = 1,
$$
and hence the adjoint rules are
$$
\overline{A} \ \ {\small +}{=} \ \ \overline{C}
$$
and
$$
\overline{B} \ \ {\small +}{=} \ \ \overline{C}.
$$
In the more interesting case of multiplication, with $C = A \cdot B$,
$$
\frac{\partial C}{\partial A} = B,
$$
and
$$
\frac{\partial C}{\partial B} = A,
$$
leading to adjoint rules
$$
\overline{A} \ \ {\small +}{=} \ \ \overline{C} \cdot B^{\top}
$$
and
$$
\overline{B} \ \ {\small +}{=} \ \ A^{\top} \cdot \overline{C}.
$$
## References {-}
The section on trace algebra fills in the steps of the derivation presented in
[@giles:2008; -@giles:2008b].