Topic that I encountered in my GenAI Models and Robotic Applications course at Twente.
Basic Concept
Normalizing flow exploit the rule for change of variables. Normalizing flow begin with an initial distribution, and apply a sequence of K invertible transforms to formulate a new distribution.
Learns complex joint densities by decomposing the joint density into a product of one-dimensional conditional densities, where each depends on only the previous values (so just like in Markov Chains):
Quick summary of the difference between GAN, VAE, and flow-based generative models
- Generative adversarial networks: GAN provides a smart solution to model the data generation, an unsupervised learning problem, as a supervised one. The discriminator model learns to distinguish the real data from the fake samples that are produced by the generator model. Two models are trained as they are playing aΒ minimaxΒ game.
- Variational autoencoders: VAE inexplicitly optimizes the log likelihood of the data by maximizing the evidence lower bound (ELBO).
- Flow-based generative models: A flow-based generative model is constructed by a sequence of invertible transformations. Unlike other two, the model explicitly learns the data distributionΒ and therefore the loss function is simply the negative log likelihood.
What is Normalizing Flow?
Normalizing flow learns an invertible transformation between data and latent variables:
- is a target distribution data sample
- is a latent variable sampled from the source distribution
If we define , then we have:
Intuitively, we can also write
Explaining the Jacobian
is the Jacobian of the model from x to z (the inverse direction).
We can write in terms of probability density function (see Change-of-Variable Formula theorem)
Since , we have:
It describes how the transformation maps from the data space (x) to the latent space (z).
Conversely, maps from z to x.
The restriction () ensures invertibility, which is why normalizing flows require bijective transformations.
Here, indicates the ratio between the area of rectangles defined in two different coordinate of variablesΒ Β andΒ Β respectively
You can't just have
The function in normalizing flows is perfectly invertible. In normalizing flows, we care about density estimation, not reconstruction. The loss is based on the log likelihood of dataΒ Β under the model (more below).
- That would basically be an Autoencoder.
- In training, data flows from where . We maximize (or minimize the negative log likelihood):
- In sampling/generation, data flows from , sample from the base distribution, and apply the forward flow: .
A normalizing flow transforms a simple distribution into a complex one by applying a sequence of invertible transformation functions. Flowing through a chain of transformations, we repeatedly substitute the variable for the new one according to the change of variables theorem and eventually obtain a probability distribution of the final target variable.
We apply a chain of invertible transformations (map the target distribution sequentially):
As defined in the figure above, we have
What doesΒ fΒ look like? Theyβre generally Affine Transforms
Flow Path (Forward Pass):
Generative Path (Inverse Pass):
- and are neural networks (often small CNNs or MLPs)
- Same parameters are reused in both directions.
How weight updates work in flow-based models
Training is done viaΒ maximum likelihood estimation (MLE)Β using the change-of-variables formula.
Change of Variables
Given and (just a unit Gaussian):
We know that and so we can write in the end:
Training Steps
- Inverse Pass: Given data , compute
- Compute log-likelihood loss:
- Backpropagate through:
- the inverse transformations
- the neural nets and
- the log-determinant term
- Gradient Descent:
- Use Adam/SGD to update parameters in and