Add noise gradually and learn to reverse the process

x_{t - 1} = α_{t} (x_{t} - γ ϵ_{θ} (x_{t}, t)) + N (0, σ_{t}^{2})

which can also be interpreted as

x^{'} = x - γ ▽ E (x)

Denoising Diffusion Probabilistic Model (DDPM)

DDPMs are a class of generative mode where the output generation is modeled as a denoising process, often called Stochastic Langevin Dynamics.

Looking at the figure above, it’s really only a 2-step process:

A fixed (or predefined) forward diffusion process q that adds Gaussian noise
A learned reverse denoising diffusion process $p_{θ}$

1. Forward Diffusion Process in DDPM

In the forward diffusion process, for each timestep t, we add unit Gaussian noise to the previous sample $x_{t - 1}$ to produce $x_{t}$ :

x_{t} = 1 - β_{t} \cdot x_{t - 1} + β \cdot ϵ, ϵ \sim N (0, I)

$I$ is the Identity Matrix

As a condition probability, this is written as:

q (x_{t} ∣ x_{t - 1}) = (1 - β_{t}) x_{t - 1} + β_{t} ϵ

or in general form:

q (x_{T} ∣ x_{0}) = k = 1 \prod T q (x_{t} ∣ x_{t - 1})

$β_{t}$ is the variance of the noise
$1 - β_{t} \cdot x_{t - 1}$ is the mean

So basically, Forward Diffusion is a Markov Chain.

Why $1 - β_{t}$ ?

Apparently, this is to ensure that the total variance remains 1. This shows that using $1 - β_{t}$ ensures that $x_{t}$ remains unit gaussian. The fact that $x_{t - 1}$ and $ϵ$ are independent, allows the variances to sum.
$Var (x_{t}) = Var (1 - β_{t} x_{t - 1} + β_{t} ϵ) = Var (1 - β_{t} x_{t - 1}) + Var (β_{t} ϵ) = (1 - β_{t}) Var (x_{t - 1}) + β_{t} Var (ϵ) = (1 - β_{t}) I + β_{t} I = I$

Variance Schedule

$β_{t}$ does not have to be constant at each time step, we actually define variance schedule, $0 < β_{1} < β_{2} < ... < β_{T} < 1$ .

it’s quite similar to the ideas behind Learning Rate

Can be linear, quadratic, cosine, etc.

Forward Diffusion is a Stochastic Differential Equation

2. Denoising Process

Now, let’s say we want to reverse the process. We know how $p (x_{t} ∣ x_{t - 1})$ is calculated. So we want to get $p (x_{t - 1} ∣ x_{t})$ .

We know from Bayes Rule that

P (x_{t - 1} ∣ x_{t}) = \frac{P ( x _{t} ∣ x _{t - 1} ) \cdot P ( x _{t - 1} )}{P ( x _{t} )}

From all this, we don’t know $P (x_{t - 1})$ . That is the thing we want to predict.

Apparently, the solution is to always slap a universal function approximator, aka neural network

p_{θ} (x_{t - 1} ∣ x_{t}) = N (μ_{θ} (x_{t}, t), σ_{θ} (x_{t}, t))

The neural net learns two parameters: $μ_{θ}$ and $σ_{θ}$

In the original paper

They only made the neural net learn $μ$ , and fixed $σ$ the variance.

$p_{θ} (x_{t - 1} ∣ x_{t}) = N (μ_{θ} (x_{t}, t), σ_{θ} I)$

We will also assume in this course that the variance of the (to be) removed noise is also diagonal.

The reverse process is also a Markov Chain:

p_{θ} (x_{0}) = p (x_{T}) t = 1 \prod T p_{θ} (x_{t - 1} ∣ x_{t})

Formulating the loss function

We go from the forward pass

x_{t} = (1 - β_{t}) x_{t - 1} + β_{t} ϵ

to the generative pass

x_{t - 1} = \frac{1}{1 - β _{t}} (x_{t} - β_{t} \overset{ϵ}{^}_{θ} (x_{t}, t))

And so we formulate the loss function as:

L_{DM} = E_{x_{0}, ϵ \sim N (0, I), t} [∥ ϵ - \hat{ϵ}_{θ} (x_{t}, t) ∥_{2}^{2}]

How do we formulate the objective function for the neural net to learn?

We use U-Net (page incoming)

Flow matching vs. Diffusion?

The image kinda says it all.

Diffusion:

Stochastic models: given a noise sample, it generates diverse samples (many trajectories),

Gradually destroys a data point over time by progressively adding Gaussian Noise,

Trains by estimating the added noise at step t (to be removed to obtain the sample at t-1),

Needs many step for generation.

Flow:

Deterministic model: given a noise sample, it generates a specific sample(single trajectory),

The forward process is a linear interpolation of the data point and noise sample,

Trains by minimizing the difference between an estimated and ground truth (Euler) velocity,

Generates in many less steps than DMs.

Conditional Diffusion

The reverse process becomes

p_{θ} (x_{0} ∣ c) = p (x_{T}) t = 1 \prod T p_{θ} (x_{t - 1} ∣ x_{t}, c)

Classifier Guidance and Classifier-Free Guidance

Sources: Ho and Salimans, 2021 — Classifier Free, Song et al., 2021 — Classifier Guidance, Diffusion models beat gans on image synthesis, meta guide, page 33

Make sure you cover this part

L_{D M} = E_{x_{0}, ϵ \sim N (0, I), t} [∥ ϵ - (\overset{ϵ}{^}_{θ} (x_{t}, t) - γ \nabla_{x_{t}} lo g p (y ∣ x_{t})) ∥_{2}^{2}]

Latent Diffusion Model (LDM)

You go through decoder, do diffusion in latent space, and then decode that.

The idea is that diffusion is a very expensive process, but encoding / decoding is much faster

The paper covering this is High-Resolution Image Synthesis with Latent Diffusion Models.

Being likelihood-based models, they do not exhibit mode-collapse and training instabilities as GANs and, by heavily exploiting parameter sharing, they can model highly complex distributions of natural images without involving billions of parameters as in AR models

To stage training:

Train an autoencoder to encode images in latent space

Train diffusion to predict in latent space

Be predicting in latent space, we can reduce the computational load.

So what I should remember is that:

The diffusion and denoising are done on a compressed (lower-dimensional) version of the samples

I love when professors do charity work:

Some applications where we want to use diffusion models:

text-to-image generation
image editing and composition
visual illusions
novel view synthesis
policy generation in robotics
video generation
…

🚀 Costin Chitic

Recent Notes

ROS2 - Writing Publishers and Subscribers

ROS2 Commands Basics

ROS2 Starting Basics

Error Analysis of Airborne Laser Scanning Data

Point Cloud Segmentation Practical

Diffusion Models

Denoising Diffusion Probabilistic Model (DDPM)

1. Forward Diffusion Process in DDPM

2. Denoising Process

Formulating the loss function

Conditional Diffusion

Classifier Guidance and Classifier-Free Guidance

Latent Diffusion Model (LDM)

Graph View

Table of Contents

Backlinks