Resources: 1, 2, 3

Topic that I encountered in my GenAI Models and Robotic Applications course at Twente.

Basic Concept

Normalizing flows exploit the rule for change of variables. Normalizing flows begin with an initial distribution, and apply a sequence of K invertible transforms to formulate a new distribution.

Learns complex joint densities by decomposing the joint density into a product of one-dimensional conditional densities, where each $(x_{i})$ depends on only the previous $(i - 1)$ values (so just like in Markov Chains):

p_{m o d e l} (x) = i \prod p (x_{i} ∣ x_{1 : i - 1})

Quick summary of the difference between GAN, VAE, and flow-based generative models

Generative adversarial networks: GAN provides a smart solution to model the data generation, an unsupervised learning problem, as a supervised one. The discriminator model learns to distinguish the real data from the fake samples that are produced by the generator model. Two models are trained as they are playing a minimax game.

Variational autoencoders: VAE inexplicitly optimizes the log likelihood of the data by maximizing the evidence lower bound (ELBO). VAE is stochastic: it uses a stochastic encoder that samples $z$ from a learned distribution $q (z ∣ x)$ . It outputs a distribution. In VAEs, the model learns to approximate the mapping between data and latent Gaussian through separate encoder/decoder networks.

Flow-based generative models: A flow-based generative model is constructed by a sequence of invertible transformations. Unlike other two, the model explicitly learns the data distribution and therefore the loss function is simply the negative log likelihood. Normalizing Flows are deterministic: no randomness is added when transforming $x < - > z$ . It’s exact; we know what happens. In flows, the Gaussian is transformed through exact, invertible functions to match the data.

What are Normalizing Flows?

Normalizing flows learn an invertible transformation $f$ between data and latent variables: $x = f (z), z = f^{- 1} (x)$

$x$ is a target distribution data sample $x \sim p_{x} (x)$
$z$ is a latent variable sampled from the source distribution $z \sim p_{z} (z)$

x = f (z)

p_{x} (x^{'}) = p_{z} (z) det (\frac{\partial z}{\partial x}) = p_{z} (z) det (\frac{\partial f ^{- 1} ( x )}{\partial x})

If we define $J_{f} = \frac{\partial f ^{- 1} ( x )}{\partial x}$ , then we have:

p_{x} (x^{'}) = p_{z} (z) ∣ d e t [J_{f}] ∣

Intuitively, we can also write

p_{z} (z) = p_{x} (x^{'}) \frac{1}{∣ d e t [ J _{f} ] ∣} = p_{x} (x^{'}) d e t [J_{f^{- 1}}]

Explaining the Jacobian

$J_{f}$ is the Jacobian of the model from x to z (the inverse direction).

We can write in terms of probability density function (see Change-of-Variable Formula theorem)

Since $x = f (z)$ , we have:
$J_{f} = \frac{\partial z}{\partial x} = \frac{\partial f ^{- 1} ( x )}{\partial x}$
It describes how the transformation $f$ maps from the data space (x) to the latent space (z).

Conversely, $J_{f^{- 1}} = \frac{\partial x}{\partial z} = \frac{\partial f}{\partial z}$ maps from z to x.
$det [J_{f^{- 1}}] = \frac{1}{det [ J _{f} ]}$
The restriction ( $J \neq = 0$ ) ensures invertibility, which is why normalizing flows require bijective transformations.

Here, $∣ \partial f^{- 1} (x) ∣$ indicates the ratio between the area of rectangles defined in two different coordinate of variables and respectively

You can't just have $z$

The function $f$ in normalizing flows is perfectly invertible. In normalizing flows, we care about density estimation, not reconstruction. The loss is based on the log likelihood of data $x$ under the model (more below).

That would basically be an Autoencoder.

In training, data flows from $x \to z$ where $z = f^{- 1} (x)$ . We maximize $lo g p_{x} (x^{'})$ (or minimize the negative log likelihood):

lo g p_{z} (z) = lo g p_{x} (x) + lo g d e t [J_{f^{- 1}}]

lo g p_{x} (x) = lo g p_{z} (z) + lo g ∣ d e t [J_{f}] ∣

In sampling/generation, data flows from $z \to x$ , sample $z \sim p (z)$ from the base distribution, and apply the forward flow: $x = f (z)$ .

A normalizing flow transforms a simple distribution into a complex one by applying a sequence of invertible transformation functions. Flowing through a chain of transformations, we repeatedly substitute the variable for the new one according to the change of variables theorem and eventually obtain a probability distribution of the final target variable.

We apply a chain of invertible transformations (map the target distribution sequentially):

x = f_{K} \circ f_{K - 1} \circ ... \circ f_{1} (z), thus z = f^{- 1} \circ ... \circ f_{K}^{- 1} (x)

As defined in the figure above, we have

z_{i} \sim p_{i} (z_{i})

z_{i} = f_{i} (z_{i - 1}), thus z_{i - 1} = f_{i}^{- 1} (z_{i})

What does f look like? They’re generally Affine Transforms (affine coupling layer) since they are differentiable.

Flow Path (Forward Pass):

z [1 : d] = x [1 : d]

z [d + 1 : D] = x [d + 1 : D] ⊙ exp (s_{θ} (x [1 : d])) + t_{θ} (x [1 : d])

Generative Path (Inverse Pass):

x [1 : d] = z [1 : d]

x [d + 1 : D] = (z [d + 1 : D] - t_{θ} (z [1 : d])) ⊙ exp (- s_{θ} (z [1 : d]))

$s_{θ}$ and $t_{θ}$ are neural networks (often small CNNs or MLPs)
Same parameters are reused in both directions.

How weight updates work in flow-based models

Training is done via maximum likelihood estimation (MLE) using the change-of-variables formula.

Change of Variables

Given $x = f (z)$ and $z \sim N (0, I)$ (just a unit Gaussian):

lo g p_{x} (x) = lo g p_{z} (f^{- 1} (x)) + lo g det (\frac{\partial f ^{- 1}}{\partial x})

We know that $z = f^{- 1} (x)$ and so we can write in the end:

lo g p_{z} (z) = lo g p_{x} (f (z)) + lo g det [J_{f^{- 1}}]

Training Steps

Inverse Pass: Given data $x$ , compute $z = f^{- 1} (x)$
Compute log-likelihood loss:

L = - lo g p_{x} (x) = - lo g p_{z} (z) + lo g det (\frac{\partial f}{\partial z})

Backpropagate through:
- the inverse transformations $f^{- 1}$
- the neural nets $s_{θ}$ and $t_{θ}$
- the log-determinant term
Gradient Descent:
- Use Adam/SGD to update parameters $θ$ in $s_{θ}$ and $t_{θ}$

🚀 Costin Chitic

Recent Notes

ROS2 - Writing Publishers and Subscribers

ROS2 Commands Basics

ROS2 Starting Basics

Error Analysis of Airborne Laser Scanning Data

Point Cloud Segmentation Practical

Normalizing Flows

What are Normalizing Flows?

How weight updates work in flow-based models

Change of Variables

Training Steps

Graph View

Table of Contents

Backlinks