context = input autoregressive decoder = automatically generate text

self-attention = learn from corresponces from our data. what dynamics we have?

dot product attention is a row in the matrix separate strong correlation from weak correlation

Position matters in the matrix!

Encoder: input + positional information

How to add? Sinusoidal Embedding Intuition

S = topics batch = input dim channels?

CLS token for classification