Beyond attention-based methods

What if we express Linear Attention as a time-dependent process?

concept covered in Efficient FoMos. it’s a discrete recurrence.
$S_{t} = S_{t - 1} + k_{j}^{⊤} v_{j}$

I see it revisited the Change-of-Variable Formula (Jacobian Matrix) concept from Normalizing Flows. Basically we can represent the current attention to an ODE which relates functions and its derivatives to a single independent variable.

S_{t} - S_{t - 1} = \frac{d S}{d t}

I believe $S$ should be $x$ from the following state-space formulation?

Linear State-Space Layer (LSSL)

$\dot{x} (t) = A x (t) + B u (t)$

$u$ is the continuous input at time $t$

$B$ controls how input enters — Input Matrix.

$B \in R^{∣ N ∣ \times ∣ M ∣}$ ( $N$ states $\times$ $M$ inputs)

$x$ is the continuous evolving state at time $t$

$A$ controls how the state is updated — State/System Matrix.

$A \in R^{∣ N ∣ \times ∣ N ∣}$ ( $N$ states $\times$ $N$ states)

$y (t) = C x (t) + D u (t)$

$C$ determines which combinations of the states make up the measured outputs — Output Matrix

$C \in R^{∣ P ∣ \times ∣ N ∣}$ ( $P$ outputs $\times$ $N$ states)

D directly maps the external inputs to the outputs without passing through the system states — Feedthrough/Feedforward Marix

$D \in R^{∣ P ∣ \times ∣ M ∣}$ ( $P$ outputs $\times$ $M$ inputs)

LSSLs are recurrent: if a discrete step-size $Δ t$ is specified, the LSSL can be discretized into a linear recurrence using standard techniques, and simulated during inference as a stateful recurrent model with constant memory and computation per time step.
LSSLs are convolutional: the linear time-invariant systems defined by the equations above are known to be explicitly representable as a continuous convolution. Moreover, the discrete-time version can be parallelized during training using convolutions.
LSSLs are continuous-time: the LSSL itself is a differential equation. As such, it can perform unique applications of continuous-time models, such as simulating continuous processes, handling missing data, and adapting to different timescales.

However, in real world our inputs are sampled data, not continuous observations. We need to integrate our discrete signals.

ZOH converts discrete inputs to a continuous signal (staircase).

The idea is basically saving the value of the lower bound.

$a \leq t < b \to u (t) = u (a)$ , and $a$ and $b$ are samples at timesteps $T, 2 T, 3 T, \dots$

this implies however constant values in inter-sample periods

points can be represented with delays $e^{A Δ}$

$A$ is learnable

$Δ$ is our step size between $t$ and $t + 1$

We denote $\overset{ˉ}{A} = e^{A Δ}$ and $\overset{ˉ}{B} = A^{- 1} e^{(A Δ) - 1} B$ . So the new state space becomes

h_{t} = \overset{ˉ}{A} h_{t - 1} + \overset{ˉ}{B} u_{t}

with $y_{t} = C h_{t}$

However, by expanding $y$ according to Efficiently Modeling Long Sequences with Structured State Spaces, we get the next formulation:

y_{t} = C \overset{ˉ}{A}^{t} \overset{ˉ}{B} x_{0} + C \overset{ˉ}{A}^{t - 1} \overset{ˉ}{B} x_{1} + \dots + C \overset{ˉ}{A} \overset{ˉ}{B} x_{t - 1} + C \overset{ˉ}{B} x_{t}

State updates are primarily influenced by $\overset{ˉ}{A}$ :

$\overset{ˉ}{A} > 1$
- $\overset{ˉ}{A}^{t} >> \overset{ˉ}{A}$ grows large over sequence $t$ — bias towards start
$\overset{ˉ}{A} < 1$
- $\overset{ˉ}{A}^{t} << \overset{ˉ}{A}$ grows small over sequence $t$ — forgetting
$\overset{ˉ}{A} = 1$
- $\overset{ˉ}{A}^{t} = \overset{ˉ}{A}$ no selectivity

$\overset{ˉ}{A}$ should decay in a structured way based on the input

HiPPO -- High-order Polynomial Projection Operations

Different $f (t)$ polynomials capture different details — Legendre polynomials

$f (t)$ w/ $N = 1$ ⇒ $P_{0} (t)$ — flat which captures the mean

$f (t)$ w/ $N = 2$ ⇒ $P_{1} (t)$ — linear captures the slope

$f (t)$ w/ $N = 3$ ⇒ $P_{2} (t)$ — quadratic captures the curve

$f (t)$ w/ $N = 4$ ⇒ $P_{3} (t)$ — cubic captures assymetry

So we rewrite $f (t)$ as:
$f (t) \approx c_{0} P_{0} (t) + c_{1} P_{1} (t) + \dots + c_{N - 1} P_{N - 1} (t) + c_{N} P_{N} (t)$
where the coefficients $c_{0}, c_{1}, \dots c_{N}$ capture how much of each polynomial’s shape is present in the signal.

Low-order coefficients ( $c_{0}, c_{1}, \dots$ ) capture the big picture (long-range)

High-order coefficients capture fine details (short-range)

$[c_{0}, c_{1}, \dots, c_{N - 1}]$ are updated as new inputs arrive

\overset{ˉ}{A}_{nk} = ⎩ ⎨ ⎧ - (2 n + 1)^{1/2} (2 k + 1)^{1/2} - (n + 1) 0 when n > k (below diagonal) when n = k (on the diagonal) when n < k (above diagonal)

When we revisit $y_{t} = C \overset{ˉ}{A}^{t} \overset{ˉ}{B} x_{0} + C \overset{ˉ}{A}^{t - 1} \overset{ˉ}{B} x_{1} + \dots + C \overset{ˉ}{A} \overset{ˉ}{B} x_{t - 1} + C \overset{ˉ}{B} x_{t}$ , we see a multitude of consecutive matrix multiplications that are input independent. We can precompute them by defining

Structured State Space Sequence Models (S4)

$K = (C \overset{ˉ}{B}, C \overset{ˉ}{A} \overset{ˉ}{B}, \dots, C \overset{ˉ}{A}^{t - 1} \overset{ˉ}{B}, C \overset{ˉ}{A}^{t} \overset{ˉ}{B})$

$X = (x_{0}, x_{1}, \dots, x_{t - 1}, x_{t})$

Therefore, $Y = K^{⊤} X$ is a convolution kernel ⇒ parallelizable training

$y_{t} = C h_{t}$ is a recurrent process ⇒ efficient inference.

Currently there’s no selectivity over which states to remember? That’s where Mamba: Linear-Time Sequence Modeling with Selective State Spaces introduced selectivity.

The left graph demonstrates Mamba’s superiority in inference throughput compared to standard transformers of comparable sizes. As the batch size increases, standard transformers quickly run out of memory because the memory required for their KV cache grows significantly. Mamba uses a constant-size hidden state instead of a growing cache, allowing it to process much larger batch sizes rapidly without memory exhaustion.

The second graph highlights Mamba’s computational efficiency with long sequences. It compares the execution time of Mamba’s core scan operation against standard attention mechanisms as sequence length increases. Standard FlashAttention-2 scales quadratically, meaning processing time explodes for long texts. The authors’ custom hardware-optimized scan operation scales linearly, allowing Mamba to handle massive context windows of up to 512k tokens significantly faster and without the out-of-memory errors that limit other methods.

	SSMs(S6)	SSMs(S4)	Transformers
speed	$O (L)$ linear	$O (L lo g L)$	$O (L^{2})$ quadratic
memory	$O (L)$ linear	$O (L)$ linear	$O (L^{2})$ quadratic
context-aware	Yes (selective)	No — all tokens are processed equally with fixed A, B, C	Yes — softmax over Q, K

Still need to understand how they go from S4 to S6 and the selective attention process.

🚀 Costin Chitic

Recent Notes

Beyond attention-based methods

FoMo Post-training and Adaption

DINO

Self-Supervision Objectives for FoMo Pre-training

Representation Learning

Beyond attention-based methods

Graph View

Backlinks