Documentation (#27)

* add NF intro * set up doc files * add gitignore * minor update to readme * update home page * update docs for each funciton * update docs * src * update function docs * update docs * fix readme math rendering issue * update docs * update example doc * update customize layer docs * finish docs * finish docs * Update README.md Co-authored-by: Cameron Pfiffer <[email protected]> * Update README.md Co-authored-by: Cameron Pfiffer <[email protected]> * Update README.md Co-authored-by: Cameron Pfiffer <[email protected]> * Update docs/src/index.md Co-authored-by: Xianda Sun <[email protected]> * Update README.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/index.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/customized_layer.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/customized_layer.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/customized_layer.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/customized_layer.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/customized_layer.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/index.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/index.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/index.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/example.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/example.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/example.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/example.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/example.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/example.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/example.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/customized_layer.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/customized_layer.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/customized_layer.md Co-authored-by: Xianda Sun <[email protected]> * Update docs/src/customized_layer.md Co-authored-by: Xianda Sun <[email protected]> * minor ed * minor ed to fix latex issue * minor update --------- Co-authored-by: Cameron Pfiffer <[email protected]> Co-authored-by: Xianda Sun <[email protected]>
TuringLang · Aug 23, 2023 · 45101e0 · 45101e0
1 parent 8f4371d
commit 45101e0
Show file tree

Hide file tree

Showing 14 changed files with 579 additions and 13 deletions.
diff --git a/README.md b/README.md
@@ -2,3 +2,89 @@
 
 [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://turinglang.github.io/NormalizingFlows.jl/dev/)
 [![Build Status](https://github.com/TuringLang/NormalizingFlows.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/TuringLang/NormalizingFlows.jl/actions/workflows/CI.yml?query=branch%3Amain)
+
+
+A normalizing flow library for Julia.
+
+The purpose of this package is to provide a simple and flexible interface for variational inference (VI) and normalizing flows (NF) for Bayesian computation or generative modeling.
+The key focus is to ensure modularity and extensibility, so that users can easily 
+construct (e.g., define customized flow layers) and combine various components 
+(e.g., choose different VI objectives or gradient estimates) 
+for variational approximation of general target distributions, 
+without being tied to specific probabilistic programming frameworks or applications. 
+
+See the [documentation](https://turinglang.org/NormalizingFlows.jl/dev/) for more.  
+
+## Installation
+To install the package, run the following command in the Julia REPL:
+```julia
+]  # enter Pkg mode
+(@v1.9) pkg> add git@github.com:TuringLang/NormalizingFlows.jl.git
+```
+Then simply run the following command to use the package:
+```julia
+using NormalizingFlows
+```
+
+## Quick recap of normalizing flows
+Normalizing flows transform a simple reference distribution $q_0$ (sometimes known as base distribution) to 
+a complex distribution $q$ using invertible functions.
+
+In more details, given the base distribution, usually a standard Gaussian distribution, i.e., $q_0 = \mathcal{N}(0, I)$,
+we apply a series of parameterized invertible transformations (called flow layers), $T_{1, \theta_1}, \cdots, T_{N, \theta_k}$, yielding that
+```math
+Z_N = T_{N, \theta_N} \circ \cdots \circ T_{1, \theta_1} (Z_0) , \quad Z_0 \sim q_0,\quad  Z_N \sim q_{\theta}, 
+```
+where $\theta = (\theta_1, \dots, \theta_N)$ is the parameter to be learned, and $q_{\theta}$ is the variational distribution (flow distribution). This describes **sampling procedure** of normalizing flows, which requires sending draws through a forward pass of these flow layers.
+
+Since all the transformations are invertible (techinically [diffeomorphic](https://en.wikipedia.org/wiki/Diffeomorphism)), we can evaluate the density of a normalizing flow distribution $q_{\theta}$ by the change of variable formula:
+```math
+q_\theta(x)=\frac{q_0\left(T_1^{-1} \circ \cdots \circ
+T_N^{-1}(x)\right)}{\prod_{n=1}^N J_n\left(T_n^{-1} \circ \cdots \circ
+T_N^{-1}(x)\right)} \quad J_n(x)=\left|\operatorname{det} \nabla_x
+T_n(x)\right|.
+```
+Here we drop the subscript $\theta_n, n = 1, \dots, N$ for simplicity. 
+Density evaluation of normalizing flow requires computing the **inverse** and the
+**Jacobian determinant** of each flow layer.
+
+Given the feasibility of i.i.d. sampling and density evaluation, normalizing flows can be trained by minimizing some statistical distances to the target distribution $p$. The typical choice of the statistical distance is the forward and backward Kullback-Leibler (KL) divergence, which leads to the following optimization problems:
+```math
+\begin{aligned}
+\text{Reverse KL:}\quad
+&\argmin _{\theta} \mathbb{E}_{q_{\theta}}\left[\log q_{\theta}(Z)-\log p(Z)\right] \\
+&= \argmin _{\theta} \mathbb{E}_{q_0}\left[\log \frac{q_\theta(T_N\circ \cdots \circ T_1(Z_0))}{p(T_N\circ \cdots \circ T_1(Z_0))}\right] \\
+&= \argmax _{\theta} \mathbb{E}_{q_0}\left[ \log p\left(T_N \circ \cdots \circ T_1(Z_0)\right)-\log q_0(X)+\sum_{n=1}^N \log J_n\left(F_n \circ \cdots \circ F_1(X)\right)\right]
+\end{aligned}
+```
+and 
+```math
+\begin{aligned}
+\text{Forward KL:}\quad
+&\argmin _{\theta} \mathbb{E}_{p}\left[\log q_{\theta}(Z)-\log p(Z)\right] \\
+&= \argmin _{\theta} \mathbb{E}_{p}\left[\log q_\theta(Z)\right] 
+\end{aligned}
+```
+Both problems can be solved via standard stochastic optimization algorithms,
+such as stochastic gradient descent (SGD) and its variants.
+
+Reverse KL minimization is typically used for **Bayesian computation**, where one
+wants to approximate a posterior distribution $p$ that is only known up to a
+normalizing constant. 
+In contrast, forward KL minimization is typically used for **generative modeling**, where one wants to approximate a complex distribution $p$ that is known up to a normalizing constant.
+
+## Current status and TODOs
+
+- [x] general interface development
+- [x] documentation
+- [ ] including more flow examples
+- [ ] GPU compatibility
+- [ ] benchmarking
+
+## Related packages
+- [Bijectors.jl](https://github.com/TuringLang/Bijectors.jl): a package for defining bijective transformations, which can be used for defining customized flow layers.
+- [Flux.jl](https://fluxml.ai/Flux.jl/stable/)
+- [Optimisers.jl](https://github.com/FluxML/Optimisers.jl)
+- [AdvancedVI.jl](https://github.com/TuringLang/AdvancedVI.jl)
+
+
diff --git a/docs/.gitignore b/docs/.gitignore
@@ -0,0 +1,2 @@
+build/
+site/
diff --git a/docs/make.jl b/docs/make.jl
@@ -10,7 +10,12 @@ makedocs(;
     repo="https://github.com/TuringLang/NormalizingFlows.jl/blob/{commit}{path}#{line}",
     sitename="NormalizingFlows.jl",
     format=Documenter.HTML(),
-    pages=["Home" => "index.md"],
+    pages=[
+        "Home" => "index.md",
+        "API" => "api.md",
+        "Example" => "example.md",
+        "Customize your own flow layer" => "customized_layer.md",
+    ],
 )
 
 deploydocs(; repo="github.com/TuringLang/NormalizingFlows.jl", devbranch="main")
diff --git a/docs/src/api.md b/docs/src/api.md
@@ -0,0 +1,93 @@
+## API
+
+```@index
+```
+
+
+## Main Function
+
+```@docs
+NormalizingFlows.train_flow
+```
+
+The flow object can be constructed by `transformed` function in `Bijectors.jl` package.
+For example of Gaussian VI, we can construct the flow as follows:
+```@julia
+using Distributions, Bijectors
+T= Float32
+q₀ = MvNormal(zeros(T, 2), ones(T, 2))
+flow = Bijectors.transformed(q₀, Bijectors.Shift(zeros(T,2)) ∘ Bijectors.Scale(ones(T, 2)))
+```
+To train the Gaussian VI targeting at distirbution $p$ via ELBO maiximization, we can run
+```@julia
+using NormalizingFlows
+
+sample_per_iter = 10
+flow_trained, stats, _ = train_flow(
+    elbo,
+    flow,
+    logp,
+    sample_per_iter;
+    max_iters=2_000,
+    optimiser=Optimisers.ADAM(0.01 * one(T)),
+)
+```
+## Variational Objectives
+We have implemented two variational objectives, namely, ELBO and the log-likelihood objective. 
+Users can also define their own objective functions, and pass it to the [`train_flow`](@ref) function.
+`train_flow` will optimize the flow parameters by maximizing `vo`.
+The objective function should take the following general form:
+```julia
+vo(rng, flow, args...) 
+```
+where `rng` is the random number generator, `flow` is the flow object, and `args...` are the
+additional arguments that users can pass to the objective function.
+
+#### Evidence Lower Bound (ELBO)
+By maximizing the ELBO, it is equivalent to minimizing the reverse KL divergence between $q_\theta$ and $p$, i.e., 
+```math 
+\begin{aligned}
+&\min _{\theta} \mathbb{E}_{q_{\theta}}\left[\log q_{\theta}(Z)-\log p(Z)\right]  \quad \text{(Reverse KL)}\\
+& = \max _{\theta} \mathbb{E}_{q_0}\left[ \log p\left(T_N \circ \cdots \circ
+T_1(Z_0)\right)-\log q_0(X)+\sum_{n=1}^N \log J_n\left(F_n \circ \cdots \circ
+F_1(X)\right)\right] \quad \text{(ELBO)} 
+\end{aligned}
+```
+Reverse KL minimization is typically used for **Bayesian computation**, 
+where one only has access to the log-(unnormalized)density of the target distribution $p$ (e.g., a Bayesian posterior distribution), 
+and hope to generate approximate samples from it.
+
+```@docs
+NormalizingFlows.elbo
+```
+#### Log-likelihood
+
+By maximizing the log-likelihood, it is equivalent to minimizing the forward KL divergence between $q_\theta$ and $p$, i.e., 
+```math 
+\begin{aligned}
+& \min_{\theta} \mathbb{E}_{p}\left[\log q_{\theta}(Z)-\log p(Z)\right] \quad \text{(Forward KL)} \\
+& = \max_{\theta} \mathbb{E}_{p}\left[\log q_{\theta}(Z)\right] \quad \text{(Expected log-likelihood)}
+\end{aligned}
+```
+Forward KL minimization is typically used for **generative modeling**, 
+where one is given a set of samples from the target distribution $p$ (e.g., images)
+and aims to learn the density or a generative process that outputs high quality samples.
+
+```@docs
+NormalizingFlows.loglikelihood
+```
+
+
+## Training Loop
+
+```@docs
+NormalizingFlows.optimize
+```
+
+
+## Utility Functions for Taking Gradient
+```@docs
+NormalizingFlows.grad!
+NormalizingFlows.value_and_gradient!
+```
+
diff --git a/docs/src/banana.png b/docs/src/banana.png
diff --git a/docs/src/comparison.png b/docs/src/comparison.png
diff --git a/docs/src/customized_layer.md b/docs/src/customized_layer.md
@@ -0,0 +1,180 @@
+# Defining Your Own Flow Layer
+
+In practice, user might want to define their own normalizing flow. 
+As briefly noted in [What are normalizing flows?](@ref), the key is to define a
+customized normalizing flow layer, including its transformation and inverse,
+as well as the log-determinant of the Jacobian of the transformation.
+`Bijectors.jl` offers a convenient interface to define a customized bijection.
+We refer users to [the documentation of
+`Bijectors.jl`](https://turinglang.org/Bijectors.jl/dev/transforms/#Implementing-a-transformation)
+for more details.
+`Flux.jl` is also a useful package, offering a convenient interface to define neural networks.
+
+
+In this tutorial, we demonstrate how to define a customized normalizing flow
+layer -- an `Affine Coupling Layer` (Dinh *et al.*, 2016) -- using `Bijectors.jl` and `Flux.jl`.
+
+## Affine Coupling Flow
+
+Given an input vector $\boldsymbol{x}$, the general *coupling transformation* splits it into two
+parts: $\boldsymbol{x}_{I_1}$ and $\boldsymbol{x}_{I\setminus I_1}$. Only one
+part (e.g., $\boldsymbol{x}_{I_1}$) undergoes a bijective transformation $f$, noted as the *coupling law*, 
+based on the values of the other part (e.g., $\boldsymbol{x}_{I\setminus I_1}$), which remains unchanged. 
+```math
+\begin{array}{llll}
+c_{I_1}(\cdot ; f, \theta): & \mathbb{R}^d \rightarrow \mathbb{R}^d & c_{I_1}^{-1}(\cdot ; f, \theta): & \mathbb{R}^d \rightarrow \mathbb{R}^d \\
+& \boldsymbol{x}_{I \backslash I_1} \mapsto \boldsymbol{x}_{I \backslash I_1} & & \boldsymbol{y}_{I \backslash I_1} \mapsto \boldsymbol{y}_{I \backslash I_1} \\
+& \boldsymbol{x}_{I_1} \mapsto f\left(\boldsymbol{x}_{I_1} ; \theta\left(\boldsymbol{x}_{I\setminus I_1}\right)\right) & & \boldsymbol{y}_{I_1} \mapsto f^{-1}\left(\boldsymbol{y}_{I_1} ; \theta\left(\boldsymbol{y}_{I\setminus I_1}\right)\right)
+\end{array}
+```
+Here $\theta$ can be an arbitrary function, e.g., a neural network.
+As long as $f(\cdot; \theta(\boldsymbol{x}_{I\setminus I_1}))$ is invertible, $c_{I_1}$ is invertible, and the 
+Jacobian determinant of $c_{I_1}$ is easy to compute:
+```math
+\left|\text{det} \nabla_x c_{I_1}(x)\right| = \left|\text{det} \nabla_{x_{I_1}} f(x_{I_1}; \theta(x_{I\setminus I_1}))\right|
+```
+
+The affine coupling layer is a special case of the coupling transformation, where the coupling law $f$ is an affine function:
+```math
+\begin{aligned}
+\boldsymbol{x}_{I_1} &\mapsto \boldsymbol{x}_{I_1} \odot s\left(\boldsymbol{x}_{I\setminus I_1}\right) + t\left(\boldsymbol{x}_{I \setminus I_1}\right) \\
+\boldsymbol{x}_{I \backslash I_1} &\mapsto \boldsymbol{x}_{I \backslash I_1}
+\end{aligned}
+```
+Here, $s$ and $t$ are arbitrary functions (often neural networks) called the "scaling" and "translation" functions, respectively. 
+They produce vectors of the
+same dimension as $\boldsymbol{x}_{I_1}$.
+
+
+## Implementing Affine Coupling Layer
+
+We start by defining a simple 3-layer multi-layer perceptron (MLP) using `Flux.jl`, 
+which will be used to define the scaling $s$ and translation functions $t$ in the affine coupling layer.
+```@example afc
+using Flux
+
+function MLP_3layer(input_dim::Int, hdims::Int, output_dim::Int; activation=Flux.leakyrelu)
+    return Chain(
+        Flux.Dense(input_dim, hdims, activation),
+        Flux.Dense(hdims, hdims, activation),
+        Flux.Dense(hdims, output_dim),
+    )
+end
+```
+
+#### Construct the Object
+
+Following the user interface of `Bijectors.jl`, we define a struct `AffineCoupling` as a subtype of `Bijectors.Bijector`.
+The functions `parition` , `combine` are used to partition and recombine a vector into 3 disjoint subvectors. 
+And `PartitionMask` is used to store this partition rule. 
+These three functions are
+all defined in `Bijectors.jl`; see the [documentaion](https://github.com/TuringLang/Bijectors.jl/blob/49c138fddd3561c893592a75b211ff6ad949e859/src/bijectors/coupling.jl#L3) for more details.
+
+```@example afc
+using Functors
+using Bijectors
+using Bijectors: partition, combine, PartitionMask
+
+struct AffineCoupling <: Bijectors.Bijector
+    dim::Int
+    mask::Bijectors.PartitionMask
+    s::Flux.Chain
+    t::Flux.Chain
+end
+
+# to apply functions to the parameters that are contained in AffineCoupling.s and AffineCoupling.t, 
+# and to re-build the struct from the parameters, we use the functor interface of `Functors.jl` 
+# see https://fluxml.ai/Flux.jl/stable/models/functors/#Functors.functor
+@functor AffineCoupling (s, t)
+
+function AffineCoupling(
+    dim::Int,  # dimension of input
+    hdims::Int, # dimension of hidden units for s and t
+    mask_idx::AbstractVector, # index of dimension that one wants to apply transformations on
+)
+    cdims = length(mask_idx) # dimension of parts used to construct coupling law
+    s = MLP_3layer(cdims, hdims, cdims)
+    t = MLP_3layer(cdims, hdims, cdims)
+    mask = PartitionMask(dim, mask_idx)
+    return AffineCoupling(dim, mask, s, t)
+end
+```
+By default, we define $s$ and $t$ using the `MLP_3layer` function, which is a
+3-layer MLP with leaky ReLU activation function.
+
+#### Implement the Forward and Inverse Transformations
+
+
+```@example afc
+function Bijectors.transform(af::AffineCoupling, x::AbstractVector)
+    # partition vector using 'af.mask::PartitionMask`
+    x₁, x₂, x₃ = partition(af.mask, x)
+    y₁ = x₁ .* af.s(x₂) .+ af.t(x₂)
+    return combine(af.mask, y₁, x₂, x₃)
+end
+
+function Bijectors.transform(iaf::Inverse{<:AffineCoupling}, y::AbstractVector)
+    af = iaf.orig
+    # partition vector using `af.mask::PartitionMask`
+    y_1, y_2, y_3 = partition(af.mask, y)
+    # inverse transformation
+    x_1 = (y_1 .- af.t(y_2)) ./ af.s(y_2)
+    return combine(af.mask, x_1, y_2, y_3)
+end
+```
+
+#### Implement the Log-determinant of the Jacobian
+Notice that here we wrap the transformation and the log-determinant of the Jacobian into a single function, `with_logabsdet_jacobian`.
+
+```@example afc
+function Bijectors.with_logabsdet_jacobian(af::AffineCoupling, x::AbstractVector)
+    x_1, x_2, x_3 = Bijectors.partition(af.mask, x)
+    y_1 = af.s(x_2) .* x_1 .+ af.t(x_2)
+    logjac = sum(log ∘ abs, af.s(x_2))
+    return combine(af.mask, y_1, x_2, x_3), logjac
+end
+
+function Bijectors.with_logabsdet_jacobian(
+    iaf::Inverse{<:AffineCoupling}, y::AbstractVector
+)
+    af = iaf.orig
+    # partition vector using `af.mask::PartitionMask`
+    y_1, y_2, y_3 = partition(af.mask, y)
+    # inverse transformation
+    x_1 = (y_1 .- af.t(y_2)) ./ af.s(y_2)
+    logjac = -sum(log ∘ abs, af.s(y_2))
+    return combine(af.mask, x_1, y_2, y_3), logjac
+end
+```
+#### Construct Normalizing Flow
+
+Now with all the above implementations, we are ready to use the `AffineCoupling` layer for normalizing flow 
+by applying it to a base distribution $q_0$.
+
+```@example afc
+using Random, Distributions, LinearAlgebra
+dim = 4
+hdims = 10
+Ls = [
+    AffineCoupling(dim, hdims, 1:2), 
+    AffineCoupling(dim, hdims, 3:4), 
+    AffineCoupling(dim, hdims, 1:2), 
+    AffineCoupling(dim, hdims, 3:4), 
+    ]
+ts = reduce(∘, Ls)
+q₀ = MvNormal(zeros(Float32, dim), I)
+flow = Bijectors.transformed(q₀, ts)
+```
+We can now sample from the flow:
+```@example afc
+x = rand(flow, 10)
+```
+And evaluate the density of the flow:
+```@example afc
+logpdf(flow, x[:,1])
+```
+
+
+## Reference
+Dinh, L., Sohl-Dickstein, J. and Bengio, S., 2016. *Density estimation using real nvp.* 
+arXiv:1605.08803.
diff --git a/docs/src/elbo.png b/docs/src/elbo.png