Lecture from the GenAI Models and Robotic Applications. Related to Normalizing Flows.

Resources: Flow Matching for Generative Modeling paper, this blog from Cambridge

Flow matching is a training method used to learn a mapping from a source distribution to a target distribution by approximating the underlying vector field.

Diffusion $\in$ Flow Matching $\in$ Generative Models

Velocity Fields and Continuous Normalizing Flows (CNF)

In NF, we had $x = f (z)$ and $x = f_{K} \circ f_{K - 1} \circ ... \circ f_{1} (z)$ . The transformations are independent.

Deblai: base probability density (to sample from)
Remblai: target probability density (images, videos).

Time-Dependent Velocity Field

We have a trajectory from the noise to a realistic output $ϕ_{0} (x) = x_{0}$ …

Different definitions use different steps. We use $t = 0 \to t = 1$ .

We can now consider the rate of displacement in time with help of derivatives.

\frac{d}{d t} ϕ_{t} (x) = v_{t} (ϕ (x))

The velocity is essentially a model that given an input, time step, and parameter, it’s able to approximate the velocity with a neural network.

f (x, t, θ) = v_{t} (θ_{t} (x))

We have a ordinary differential equation (ODE).

v_{t} (ϕ (x)) = \frac{d}{d t} ϕ_{t} (x)

f_{t} (x) = \frac{d x}{d t}

Which velocity fields are relevant?

We already covered in Normalizing Flows that with the help of Change-of-Variable Formula we can go from one Density Function to another with the following formula:

p_{ba se} (x^{'}) = q (z^{'}) ∣ d e t [J_{ϕ}] ∣

which is the same as

q (z^{'}) = p_{ba se} (x^{'}) d e t [J_{ϕ^{- 1}}]

where training is done via Maximum Likelihood Estimation

lo g q (z) = lo g p_{ba se} (x) + lo g d e t [J_{ϕ^{- 1}}]

don’t forget that to go from target distribution space $x$ to latent space $z$ , we define $x = ϕ^{- 1} (z)$

The logarithmic compression of a volume element equals the time-integrated expansion rate along its path

Jacobi’s formula

At this moment, we have complexity $O (n^{2})$ because the Jacobian Matrix $J$ has $n \times n$ elements. If we want to compute the determinant of that, we actually have $O (n^{3})$ complexity.

So using Jacobi’s formula, we show that we only need the Trace of the Jacobian Matrix (sum of the diagonal)

det [J_{ϕ^{- 1}}] = det \frac{\partial x _{1}}{\partial z _{1}} \frac{\partial x _{2}}{\partial z _{1}} ⋮ \frac{\partial x _{n}}{\partial z _{1}} \frac{\partial x _{1}}{\partial z _{2}} \frac{\partial x _{2}}{\partial z _{2}} ⋮ \frac{\partial x _{n}}{\partial z _{2}} \dots \dots ⋱ \dots \frac{\partial x _{1}}{\partial z _{n}} \frac{\partial x _{2}}{\partial z _{n}} ⋮ \frac{\partial x _{n}}{\partial z _{n}}

lo g det [J_{ϕ^{- 1}}] = - \int_{0}^{1} Tr (\frac{\partial v _{t} ( ϕ _{t} ( x ))}{\partial ϕ _{t} ( x )}) d t

This reduces the complexity from $O (n^{2})$ to $O (n)$ .

So now, considering the equations from before, we can express the end location as the start $+$ the accumulated displacement over $t = {0, 1}$ :

lo g q (ϕ_{1} (x)) = lo g p_{ba se} (ϕ_{0} (z)) + \int_{0}^{1} Tr (\frac{\partial v _{t} ( ϕ _{t} ( x ))}{\partial ϕ _{t} ( x )}) d t

ϕ_{1} (x) = x + \int_{t = 0}^{1} v_{t} (ϕ_{t} (x)) d t

ODE numerical simulations are very expensive

Since we want to train a ODE NN with max likelihood. This is where Euler’s Method comes in.

For ordinary differential equations (ODEs) we can approximate their solution by taking small sequential steps using a tangent

Euler’s Method
$x = x_{0} + f_{t = 0} (x_{0}) (t - t_{0})$

So I hope it’s clear that we approximate the velocity field with an ODE solver (e.g. Euler’s rule).

Now we can define the Loss Function of Flow Matching based models which is based on MSE (Mean Squared Error):

L_{FM} (θ) = E_{t, p_{t} (x)} ∣ ∣ v_{t} (x; θ) - u_{t} (x) ∣ ∣^{2}

where

$p_{t} (x)$ is the probability path
$v_{t} (x; θ)$ is the velocity NN with $θ$ params
$u_{t} (x)$ is the velocity field from solver (Euler)

So I covered the fact that CNFs are slow due to the ODE integration at each iteration. CFMs are a very nice solution!

(Conditional) Flow Matching (CFM)

Can be CLIP as the condition(the label). They use the Euler approximation which gives those cartoonish effects in most generative models’ output. So it’s basically always sampling through an enormous database (400 million) in case of CLIP.

But what should the probability path be?

p_{t} (x) = \int p_{t} (x ∣ x_{1}) d x_{1}

The CFM loss function learns a velocity field conditioned on target samples $x_{1}$ from $q (x_{1})$ , where $p_{t} (x ∣ x_{1})$ is a path distribution.

The key difference: CFM conditions on individual data points $x_{1}$ from the target distribution $q (x_{1})$ , making training easier by decomposing the problem into simpler conditional paths.

L_{CFM} (θ) = E_{t, q (x_{1}), p_{t} (x ∣ x_{1})} ∣∣ v_{t} (x; θ) - u_{t} (x ∣ x_{1}) ∣ ∣^{2}

$x_{1}$ is the image in the database
$x_{0}$ is the noise sample.
c is the constraint from CLIP (text) or any other VLM.

CFM inference: find a image in the database based on the constraint (c = labels(i)).

Key Insights:

Their gradients are identical. The two loss functions only differ by a constant offset C
Optimizing CFMs is equivalent to optimizing FMs (same optimal parameters)
So you can train using CFM and get the same results as the harder-to-compute FM

Reminder: velocity is the rate of change between random noise and the image from the database.

You use the conditional flows to define your loss function (left) during training, but what emerges from that training is a network that predicts the marginal flow(right).

Conditional flows are the individual velocity fields for specific conditioning pairs $(x, ϵ)$ . They are used to construct the training objective but aren’t directly what the network predicts.

Marginal flows are what you get when you average all those conditional flows together. This is the actual velocity field the network learns to predict.

Mean Flows

Resource: Mean Flows for One-step Generative Modeling

Code implementation: this blog

The core idea is to introduce a new ground-truth field representing the average velocity $\overset{v}{ˉ}$ , whereas the velocity modeled in Flow Matching represents the instantaneous velocity $v$ .

Flow Matching essentially models the expectation over all possibilities, called the marginal velocity $v (z_{t}, t) = E_{ρ_{t} (v_{t} ∣ z_{t})} [v_{t}]$ given a marginal velocity field $\frac{d}{d t} z_{t} = v (z_{t}, t)$

Given the actual data $x \sim p_{d a t a} (x)$ and the noise $ϵ \sim p_{p r i or} (ϵ)$ , a flow path can be constructed as $z_{t} = t x + (1 - t) ϵ$

\overset{v}{ˉ}_{t} (z_{t}, t, r) = \frac{1}{t - r} \int_{r}^{t} v (z_{τ}, τ) d τ

By setting $r = 0$ and $t = 1$ , we can instantly generate outputs from inputs (1-step only). So the “core idea” is we can generate with less steps, since the model already averages over $r, t$ . If $r = t$ , then it reduces to standard Flow Matching.

The bigger the step, much farther away we are from the actual image. I want to not be constrained by where I am. I want to start at r and stop at t. I’ll take the average velocity accumulated. $τ = t - r$ is the new timestep. It works best for bigger steps.

The ultimate aim will be to approximate the average velocity using a neural network $\overset{v}{ˉ} (z_{t}, t, r, θ)$ . The approach is much more amenable to single or few-step generation, as it does not need to explicitly approximate a time integral at inference time, which was required when modeling instantaneous velocity.

Mean Flow Training

The Mean Flow Identity:

(t - r) \overset{v}{ˉ} (z_{t}, r, t) = \int_{r}^{t} v (z_{t}, τ) d τ

\frac{d}{d t} (t - r) \overset{v}{ˉ} (z_{t}, r, t) = \frac{d}{d t} \int_{r}^{t} v (z_{t}, τ) d τ

\frac{d}{d t} (t - r) \overset{v}{ˉ} (z_{t}, r, t) = v (z_{t}, t)

(t - r) \frac{d}{d t} \overset{v}{ˉ} (z_{t}, r, t) + \overset{v}{ˉ} (z_{t}, r, t) = v (z_{t}, t)

And therefore we can finally express the average velocity as:

\overset{v}{ˉ} (z_{t}, r, t) = v (z_{t}, t) - (t - r) \frac{d}{d t} \overset{v}{ˉ} (z_{t}, r, t)

Training with Average Velocity:

We now introduce a model $\overset{v}{ˉ} (z_{t}, r, t, θ)$ to learn $\overset{v}{ˉ} (z_{t}, r, t)$ (denoted as $\overset{v}{ˉ}_{t g t}$ )

And in the end, the loss function for Mean Flow models is represented as

L_{MF} (θ) = E ∣ ∣ \overset{v}{ˉ} (z_{t}, r, t, θ) - s g (\overset{v}{ˉ}_{t g t}) ∣ ∣_{2}^{2}

From the paper: The term $\overset{v}{ˉ}_{t g t}$ uses the instantaneous velocity $v$ as the only ground-truth signal; no integral computation is needed. While the target should involve derivatives of $\overset{v}{ˉ}$ ( $\partial \overset{v}{ˉ}$ ), they are replaced by their parametrized counterparts ( $\partial \overset{v}{ˉ}_{θ})$ . In the loss function, a stop-gradient (sg) operation is applied on the target because it eliminates the need for “double backpropagation” through the Jacobian-vector product, thereby avoiding higher-order optimization.

The loss function is as in FM, but they take the mean velocity. Understand the highlighted part from the paper ( $t = r$ ). You mix the mean flow and the normal flow (In some steps you do one, in some you do the other). I mentioned above that if $r = t$ , then it reduces to standard Flow Matching.

🚀 Costin Chitic

Recent Notes

ROS2 - Writing Publishers and Subscribers

ROS2 Commands Basics

ROS2 Starting Basics

Error Analysis of Airborne Laser Scanning Data

Point Cloud Segmentation Practical

Flow Matching and Mean Flows

Velocity Fields and Continuous Normalizing Flows (CNF)

Time-Dependent Velocity Field

(Conditional) Flow Matching (CFM)

Mean Flows

Mean Flow Training

Graph View

Table of Contents

Backlinks