These two are quality metrics. Source: 1.

Both Inception score (IS) and Frechet inception distance (FID) are metrics used to assess the quality of images generated by generative models like GANs, diffusion models etc. The FID score is the current standard metric for assessing the quality of models that generate synthetic images as of 2024.

Inception Score

But what about object attributes? e.g. edges, shape, colors, etc. This is where IS comes in.

A pretrained image classification model — inceptionv3 is used to calculate this score. The model inceptionv3 is pretrained on the ImageNet dataset and has 1000 classes/labels. The images produced by the generative model are passed through inceptionv3 network and the probability of the image over each class/label is calculated.

\overset{p}{^} (y) = \frac{1}{N} i = 1 \sum N p (y ∣ x^{(i)})

I S (G) \approx e x p (\frac{1}{N} i = 1 \sum N D_{K L} (p (y ∣ x^{(i)} ∣∣ \overset{p}{^} (y))))

Wtf is this metric trying to capture?

Images generated by our generative model should contain clear, sharp and distinct objects.
The generative model should output diverse set of images from all different classes in ImageNet dataset

Higher value of IS indicates better quality of images.

Images with meaningful objects are supposed to have low label (output) entropy, that is, they belong to few object classes.
On the other hand, the entropy across images should be high, that is, the variance over the images should be large.

Limitation

In simpler terms: IS doesn’t compare real and generated image statistics. It only looks at generated samples in isolation, without checking if they match the distribution of real-world data. The math shows that two distributions are equal only when their expectations match across all possible basis functions - but IS doesn’t do this comparison.

Key Weakness: IS can be high even if the generated images don’t look like real images, as long as they’re clear and diverse. A GAN could generate sharp, varied images of unrealistic objects and still score well. This is where FID comes in.

FID

FID fixes the main problem with IS by directly comparing real and synthetic image distributions.

The difference of two Gaussians (synthetic and real-world images) is measured by the Frechet distance also known as Wasserstein-2 distance. We call the Frechet Inception Distance (FID) $d$ between the Gaussian with mean $(m, C)$ obtained from $p (.)$ and the Gaussian with mean $(m_{w}, C_{w})$ obtained from $p_{w} (.)$ and it’s computed as:

d^{2} ((m, C), (m_{w}, C_{w})) = ∣∣ m - m_{w} ∣ ∣_{2}^{2} + T r (C + C_{w} - 2 (C C_{w})^{1/2})

The key improvement: FID uses statistics from both real ( $p_{w}$ ) and synthetic ( $p$ ) images, unlike IS which only looks at synthetic images. This means FID can detect when your GAN generates sharp, diverse images that still don’t look realistic. They even tested it on progressively corrupted images:

Gaussian noise,
Gaussian blur,
Salt and pepper noise,
…

And FID successfully captured the disturbance level in all cases, confirming it aligns with human judgment of image quality degradation.

Lower FID = better (distributions are closer)

So, the steps of FID are:

Pass both real and generated images through InceptionV3
Extract features from an intermediate layer (not final classification)
Model both feature distributions as Gaussians with means ( $m, m_{w}$ ) and covariances ( $C, C_{w}$ )
Measure the distance between these Gaussians using the formula above.

🚀 Costin Chitic

Recent Notes

Actor-Critic Methods

Deep Q-Learning

Monte Carlo Learning

Proximal Policy Optimization (PPO)

Q-Learning

Frechet Inception Distance and Inception Score

Inception Score

Limitation

FID

Graph View

Table of Contents

Backlinks