Q, K, V Projections
How one vector is transformed into query, key, and value versions used by attention.
Attention Is All You Need · 2026 A1
Required concept
The score computation used to compare one token with another.
A dot product is “multiply matching components and add the results”.
If:
\[\vec{\mathbf{a}} = \begin{pmatrix}1 & 2\end{pmatrix}, \qquad \vec{\mathbf{b}} = \begin{pmatrix}3 & 4\end{pmatrix}\]then:
\[\vec{\mathbf{a}} \bullet \vec{\mathbf{b}} = 1 \cdot 3 + 2 \cdot 4 = 11\]In attention, the dot product compares the query vector at position i with the key vector at position j. The assignment uses the scaled dot product:
The scale factor keeps the numbers in a more manageable range before softmax.
This is why Stage 3 is the “score matrix” stage: every allowed pair (i, j) gets one score, and masked pairs are stored as -INFINITY instead.
The longer spec also unpacked the indices:
i is the current query positionj is the key position being compared againstt is the vector component indexSo:
\[\sum_{t=0}^{d-1} Q_{i,t}K_{j,t}\]just means “loop over the components, multiply matching entries, and add them”.
The dot product acts like a similarity score:
That is the intuition behind why Stage 3 scores later become attention weights.