context = input autoregressive decoder = automatically generate text
self-attention = learn from corresponces from our data. what dynamics we have?
dot product attention is a row in the matrix separate strong correlation from weak correlation
Position matters in the matrix!
Encoder: input + positional information
How to add? Sinusoidal Embedding Intuition

S = topics batch = input dim channels?
CLS token for classification