All concepts

Further reading

Reading Transformer Paper Notation

How to read the compact notation used in the original Transformer paper and later ones.

Also called: attention equation, QKTFurther Reading

The original Transformer paper, Attention Is All You Need, and many later papers often write attention in one compact line:

\[\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\]

This notation is compact, but it hides the row-by-row logic that matters for implementation.

On this page the first displayed formula keeps the paper’s plain Q, K, and V. Elsewhere in this web spec, the same matrices are written as $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$ to match the main TeX handout.

This page is about reading that paper-style notation confidently:

Here is the direct translation:

In implementation terms:

What the compact formula is really compressing

The compact formula hides these matrix meanings:

  • $\mathbf{Q}\mathbf{K}^\top$ is the full score matrix
  • $\mathrm{softmax}(\cdot)$ is row-wise
  • multiplying by $\mathbf{V}$ means “take weighted combinations of the value vectors”

If you are comfortable with the one-line formula but keep forgetting the row direction, remember this:

every row is one token asking, “which earlier rows matter for me?”

What papers often omit in the main equation:

That is why a teaching spec is more verbose than the paper formula: the loops need the hidden details made explicit.

How to read the one-line formula as row operations

The compact notation expands into these shapes:

  • $\mathbf{Q} \in \mathbb{R}^{n \times d}$ means each row is one query vector
  • $\mathbf{K} \in \mathbb{R}^{n \times d}$ means each row is one key vector
  • $\mathbf{V} \in \mathbb{R}^{n \times d}$ means each row is one value vector

Then:

  • $S = \mathbf{Q}\mathbf{K}^\top$ is the score matrix
  • $A = \mathrm{softmax}(S)$ is the row-wise weight matrix
  • $O = A\mathbf{V}$ is the output matrix
Why papers suddenly introduce extra letters

When you read later Transformer papers, you may also see:

  • batch dimensions like (B, n, d)
  • head indices for multi-head attention
  • compact summation notations that suppress explicit \sum symbols

Those are extensions of the same computation, not fundamentally different algorithms.