Attention
A mechanism that lets one position decide how much other positions should influence it.
Attention Is All You Need · 2026 A1
Further reading
How to read the compact notation used in the original Transformer paper and later ones.
The original Transformer paper, Attention Is All You Need, and many later papers often write attention in one compact line:
\[\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\]This notation is compact, but it hides the row-by-row logic that matters for implementation.
On this page the first displayed formula keeps the paper’s plain Q, K, and V. Elsewhere in this web spec, the same matrices are written as $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$ to match the main TeX handout.
This page is about reading that paper-style notation confidently:
QK^T are doingHere is the direct translation:
In implementation terms:
The compact formula hides these matrix meanings:
If you are comfortable with the one-line formula but keep forgetting the row direction, remember this:
every row is one token asking, “which earlier rows matter for me?”
What papers often omit in the main equation:
That is why a teaching spec is more verbose than the paper formula: the loops need the hidden details made explicit.
The compact notation expands into these shapes:
Then:
When you read later Transformer papers, you may also see:
(B, n, d)\sum symbolsThose are extensions of the same computation, not fundamentally different algorithms.