Dot Products and Scores

The score computation used to compare one token with another.

Also called: scaled dot product, attention scoreComputation

A dot product is “multiply matching components and add the results”.

If:

\[\vec{\mathbf{a}} = \begin{pmatrix}1 & 2\end{pmatrix}, \qquad \vec{\mathbf{b}} = \begin{pmatrix}3 & 4\end{pmatrix}\]

then:

\[\vec{\mathbf{a}} \bullet \vec{\mathbf{b}} = 1 \cdot 3 + 2 \cdot 4 = 11\]

In attention, the dot product compares the query vector at position i with the key vector at position j. The assignment uses the scaled dot product:

\[\text{score}_{i,j} = \frac{\vec{\mathbf{Q}}_i \bullet \vec{\mathbf{K}}_j}{\sqrt{d}}\]

The scale factor keeps the numbers in a more manageable range before softmax.

This is why Stage 3 is the “score matrix” stage: every allowed pair (i, j) gets one score, and masked pairs are stored as -INFINITY instead.

How to read the score formula indices

The longer spec also unpacked the indices:

i is the current query position
j is the key position being compared against
t is the vector component index

So:

\[\sum_{t=0}^{d-1} Q_{i,t}K_{j,t}\]

just means “loop over the components, multiply matching entries, and add them”.

Why a dot product is useful here

The dot product acts like a similarity score:

if two vectors point in similar directions, the dot product tends to be larger
if they are less aligned, it tends to be smaller

That is the intuition behind why Stage 3 scores later become attention weights.

Dot Products and Scores

Related explainers

Q, K, V Projections

Softmax and Stable Softmax