Sources: UTwente slides, Stochastic trajectory prediction via motion indeterminacy diffusion, Vectornet: Encoding hd maps and agent dynamics from vectorized representation, LAformer: Trajectory Prediction for Autonomous Driving with Lane-Aware Scene Constraints

Trajectory Conditional Prediction:

  • where the constraints include social interactions with other agents, maps for e.g.
  • is the observation time horizon and is the prediction time horizon.

Conditional Prediction: you need to make decisions in either seconds (e.g. autonomous driving), or minutes (e.g. marine application)

We model the trajectory prediction as spatial-temporal mapping and then we can divide it into:

  • Scene constraints
  • Multi-path prediction
  • Interaction modeling

The limitations of generative models on this task are:

Diffusion Models for Trajectory Prediction

Why Diffusion?

  • Multi-modality: naturally generate diverse possible futures (not just one)
  • Stable training: unlike GANs, no mode collapse or adversarial instability
  • Uncertainty modeling: probabilistic sampling fits real-world robotics needs
  • Flexible conditioning: can incorporate maps, goals, dynamics, safety constraints
  • Strong empirical results: state-of-the-art in trajectory forecasting & robot planning

So basically, they are state-of-the-art. And in this case, many applications have already been developed on top of Diffusion. There are interesting improvements like Mean Flow which appeared in May 2025. Anyways..

So we return to the Conditional Prediction:

Encoding of the condition (C): scene constraints and interactions

This can be done through:

  • Rasterized maps
    • e.g. bird-eye-views, semantic maps
  • Vectorized maps
    • e.g. HD maps

PROS and CONS of these visualizations

Interaction Modeling:

  1. Global interactions using Graph Convolutional Networks (GCN).
  • A fully-connected graph models agent-to-agent, agent-to-scene, and scene-to-scene interactions
  • Self-supervised learning is used to predict the masked nodes
  1. Agent-to-scene interactions using using likelihood estimation
  • Use a binary classifier to estimate the likelihood of each lane aligned with the target agent’s motion dynamics at each time step
  • Only select the top-k lane candidates
  1. Interactions modeling with attentions (a.k.a. leveraging transformers).
  • Each agent computes a query vector (as in its own. the queries are individual)
  • Other agents provide the keys and values.
  • Then simply compute the attention weights.
  • We can expect the result to be a weighted sum of interactions.
  • My intuition tells me you would need lots of data for this.

Multi Path-Prediction:

Reminding diffusion: a forward diffusion process that gradually corrupts an input sample by adding Gaussian noise over timesteps.

where is the cumulative product of the noise schedule parameters with

The denoising step trains a neural network to reverse the noise and recover data.

So, how to apply diffusion models for trajectory prediction?

We defined the conditional prediction as .

Now we are going to denoise :

  • is the encoding of the condition
  • , where is the maximum number of diffusion steps.

code implementation: github lik. There were also lots of teams who submitted their approaches to the Argoverse 2: Motion Prediction Challenge. I will leave the link here in case of future need: link to challenge.

One colleague asked in class why do we always have to make it stochastic? A concept such as trajectory prediction can be very simply be made deterministic using concepts such as cubic polynomials, splines, or Bezier Curves. For example, during my bachelors thesis, I was collaborating with the Bosch Future Mobility Challenge group, and they approximated the future short-distance trajectory using Bezier Curves; which I found extremely interesting. But I guess stochasticity allows you to slap the universal function approximator (aka Neural Networks).

Some metrics include Average Displacement Error (ADE), Final Displacement Error (FDE)

where

  • is the number of agents,
  • t are the timesteps,
  • is the predicted step while is the ground-truth position.

On top of these two, some other metrics can be defined.

  • Miss Rate (MR)
    • The number of scenarios where none of the forecasted trajectories are within 2.0 meters of ground truth according to the endpoint error
    • This metric gives a hint, in general, how many scenarios are failed
  • Collision Rate (CR)
    • Percentage of generated trajectories that collide with other agents or obstacles, distance < 0.1m
  • Multimodal Predictions: as models output samples
    • minADE_K minFDE_K
      • best-of-K error (take the predictions closest to the ground-truth)
    • Miss Rate (MR@K)
      • Fraction of cases where none of the predicted trajectories fall within a set threshold (e.g., 2m) of the ground truth final point
      • Useful to measure coverage of plausible futures
    • Brier-minFDE
      • ,
      • is the probability of the best predicted trajectory out of the K samples.
  • Negative Log-Likelihood