All concepts

Reading Transformer Paper Notation

How to read the compact notation used in the original Transformer paper and later ones.

Also called: attention equation, QKTFurther Reading

The original Transformer paper, Attention Is All You Need, and many later papers often write attention in one compact line:

\[\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\]

This notation is compact, but it hides the row-by-row logic that matters for implementation.

On this page the first displayed formula keeps the paper’s plain Q, K, and V. Elsewhere in this web spec, the same matrices are written as $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$ to match the main TeX handout.

This page is about reading that paper-style notation confidently:

what the symbols mean
what matrix products like QK^T are doing
how the compact equation expands into the row operations you would actually code

Here is the direct translation:

$\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$ are whole matrices, not single vectors
$\mathbf{Q}\mathbf{K}^T$ means “compute all query-vs-key dot products”
softmax is applied row-wise
multiplying by $\mathbf{V}$ means “use each row of weights to form a weighted sum of value vectors”

In implementation terms:

Stage 2 builds $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$
Stage 3 builds the score matrix
Stage 4 applies row-wise stable softmax with masking
Stage 5 multiplies those weights by $\mathbf{V}$

What the compact formula is really compressing

The compact formula hides these matrix meanings:

$\mathbf{Q}\mathbf{K}^\top$ is the full score matrix
$\mathrm{softmax}(\cdot)$ is row-wise
multiplying by $\mathbf{V}$ means “take weighted combinations of the value vectors”

If you are comfortable with the one-line formula but keep forgetting the row direction, remember this:

every row is one token asking, “which earlier rows matter for me?”

What papers often omit in the main equation:

which direction softmax runs
the masking rules
the exact array shapes

That is why a teaching spec is more verbose than the paper formula: the loops need the hidden details made explicit.

How to read the one-line formula as row operations

The compact notation expands into these shapes:

$\mathbf{Q} \in \mathbb{R}^{n \times d}$ means each row is one query vector
$\mathbf{K} \in \mathbb{R}^{n \times d}$ means each row is one key vector
$\mathbf{V} \in \mathbb{R}^{n \times d}$ means each row is one value vector

Then:

$S = \mathbf{Q}\mathbf{K}^\top$ is the score matrix
$A = \mathrm{softmax}(S)$ is the row-wise weight matrix
$O = A\mathbf{V}$ is the output matrix

Why papers suddenly introduce extra letters

When you read later Transformer papers, you may also see:

batch dimensions like (B, n, d)
head indices for multi-head attention
compact summation notations that suppress explicit \sum symbols

Those are extensions of the same computation, not fundamentally different algorithms.

Reading Transformer Paper Notation

Related explainers

Attention

Q, K, V Projections

Softmax and Stable Softmax