Flash Attention

Source: paper, Steven.

FlashAttention is a memory-efficient exact-attention algorithm that fuses the whole attention computation into a single tiled CUDA kernel, avoiding ever materializing the full $N \times N$ attention matrix.

It’s not a different attention mechanism! It’s the same softmax attention, just computed without ever materializing the $N \times N$ attention matrix in slow DRAM (tiling + recursive stable-softmax trick).

The naive implementation from transformers computes $S = Q K^{T}$ of shape $N \times N$ , writes it to DRAM, reads it back to softmax, writes again, reads to multiply with $V$ . Covered in Efficient FoMos. For long contexts this is both memory-quadratic and bandwidth-bound.

The process

Load blocks of $Q, K, V$ from DRAM to SRAM one tile at a time
Compute the tile’s partial attention in SRAM
Use an online softmax that updates the running normalizer as new tiles arrive, so a global pass isn’t needed

Net effect: O(N) memory instead of $O (N^{2})$ , plus 2-4 $\times$ wall-clock speedup on long sequences.

🚀 Costin Chitic

Recent Notes

Flash Attention

Mixture of Experts (MOE)

Neural ODEs

Beyond attention-based methods

FoMo Post-training and Adaption

Flash Attention

Graph View

Backlinks