lilianweng · AshishBora · Jun 6, 2022 · Jun 6, 2022
diff --git a/posts/2018-10-13-flow-models/index.html b/posts/2018-10-13-flow-models/index.html
@@ -474,7 +474,7 @@ <h2 id="realnvp">RealNVP<a hidden class="anchor" aria-hidden="true" href="#realn
 $$
 </div>
 <p>So far, the affine coupling layer looks perfect for constructing a normalizing flow :)</p>
-<p>Even better, since (i) computing $f^-1$ does not require computing the inverse of $s$ or $t$ and (ii) computing the Jacobian determinant does not involve computing the Jacobian of $s$ or $t$, those functions can be <em>arbitrarily complex</em>; i.e. both $s$ and $t$ can be modeled by deep neural networks.</p>
+<p>Even better, since (i) computing $f^{-1}$ does not require computing the inverse of $s$ or $t$ and (ii) computing the Jacobian determinant does not involve computing the Jacobian of $s$ or $t$, those functions can be <em>arbitrarily complex</em>; i.e. both $s$ and $t$ can be modeled by deep neural networks.</p>
 <p>In one affine coupling layer, some dimensions (channels) remain unchanged. To make sure all the inputs have a chance to be altered, the model reverses the ordering in each layer so that different components are left unchanged. Following such an alternating pattern, the set of units which remain identical in one transformation layer are always modified in the next. Batch normalization is found to help training models with a very deep stack of coupling layers.</p>
 <p>Furthermore, RealNVP can work in a multi-scale architecture to build a more efficient model for large inputs. The multi-scale architecture applies several &ldquo;sampling&rdquo; operations to normal affine layers, including spatial checkerboard pattern masking, squeezing operation, and channel-wise masking. Read the <a href="https://arxiv.org/abs/1605.08803">paper</a> for more details on the multi-scale architecture.</p>
 <h2 id="nice">NICE<a hidden class="anchor" aria-hidden="true" href="#nice">#</a></h2>
@@ -501,7 +501,7 @@ <h2 id="glow">Glow<a hidden class="anchor" aria-hidden="true" href="#glow">#</a>
 <p>It performs an affine transformation using a scale and bias parameter per channel, similar to batch normalization, but works for mini-batch size 1. The parameters are trainable but initialized so that the first minibatch of data have mean 0 and standard deviation 1 after actnorm.</p>
 <p>Substep 2: <strong>Invertible 1x1 conv</strong></p>
 <p>Between layers of the RealNVP flow, the ordering of channels is reversed so that all the data dimensions have a chance to be altered. A 1×1 convolution with equal number of input and output channels is <em>a generalization of any permutation</em> of the channel ordering.</p>
-<p>Say, we have an invertible 1x1 convolution of an input $h \times w \times c$ tensor $\mathbf{h}$ with a weight matrix $\mathbf{W}$ of size $c \times c$. The output is a $h \times w \times c$ tensor, labeled as $f = \texttt{conv2d}(\mathbf{h}; \mathbf{W})$. In order to apply the change of variable rule, we need to compute the Jacobian determinant $\vert \det\partial f / \partial\mathbf{h}\vert$.</p>
+<p>Say, we have an invertible 1x1 convolution of an input $h \times w \times c$ tensor $\mathbf{h}$ with a weight matrix $\mathbf{W}$ of size $c \times c$. The output is a $h \times w \times c$ tensor, labeled as $f(\mathbf{h}) = \texttt{conv2d}(\mathbf{h}; \mathbf{W})$. In order to apply the change of variable rule, we need to compute the Jacobian determinant $\vert \det\partial f / \partial\mathbf{h}\vert$.</p>
 <p>Both the input and output of 1x1 convolution here can be viewed as a matrix of size $h \times w$. Each entry $\mathbf{x}_{ij}$ ($i=1,\dots,h, j=1,\dots,w$) in $\mathbf{h}$ is a vector of $c$ channels and each entry is multiplied by the weight matrix $\mathbf{W}$ to obtain the corresponding entry $\mathbf{y}_{ij}$ in the output matrix respectively. The derivative of each entry is $\partial \mathbf{x}_{ij} \mathbf{W} / \partial\mathbf{x}_{ij} = \mathbf{W}$ and there are $h \times w$ such entries in total:</p>
 <div>
 $$