All concepts

Required concept

Q, K, and V Projections

Why the same embedding is remixed into query, key, and value vectors.

Also called: projection, QKVFoundations

Each token embedding row is turned into three related vectors:

The assignment calls this a projection because each new vector is produced by multiplying the original embedding row by a different matrix:

\[\mathbf{Q} = \mathbf{X}\mathbf{W_q},\qquad \mathbf{K} = \mathbf{X}\mathbf{W_k},\qquad \mathbf{V} = \mathbf{X}\mathbf{W_v}\]
flowchart LR
    X[One embedding row x] --> WQ[Multiply by Wq]
    X --> WK[Multiply by Wk]
    X --> WV[Multiply by Wv]
    WQ --> Q[Query row]
    WK --> K[Key row]
    WV --> V[Value row]
One embedding row becomes Q, K, and V
One embedding row x
Multiply by Wq Query row
Multiply by Wk Key row
Multiply by Wv Value row

Here is one small worked example.

Input vector x

2 3

One token embedding row.

×

Weight matrix W

1 10
0 -1

Each output component uses one column of W.

=

Projected vector y

Hover 2 or 17 to inspect that component inline.

Hover or focus 2 or 17 above to highlight the ingredients of that output component.

How the first output component is computed y_0 = 2·1 + 3·0 = 2

Use the first column of W, multiply it entry-by-entry against the input row, then add the results.

How the second output component is computed y_1 = 2·10 + 3·(-1) = 17

Use the second column of W, multiply it entry-by-entry against the same input row, then add the results.

Projection example shown as vectors on axes A smaller input vector x equals 2 comma 3 and a taller projected vector y equals 2 comma 17 are drawn from the origin on the same axes. 0 1 2 3 4 0 3 6 9 12 15 18 2 3 × 1 × 0 × 10 × -1 input x = (2, 3) output y = (2, 17) first output component = 2 second output component = 17
If you like geometric intuition, the matrix is taking weighted combinations of the original coordinates. Hover an output value to inspect one component inline.
Worked projection example

Input vector x

(2, 3)

One token embedding row.

Weight matrix W

[ [ 1, 10 ],
  [ 0, -1 ] ]

Each output component uses one column of W.

Projected vector y

(2, 17)

The same input row produces both output components.

First output component y_0 = 2·1 + 3·0 = 2

Use the first column of W, multiply it entry-by-entry against x, then add the results.

Second output component y_1 = 2·10 + 3·(-1) = 17

Use the second column of W, multiply it entry-by-entry against the same input row, then add the results.

Why three versions of the same token? Because attention needs one representation for matching, one for being matched against, and one for the information actually carried forward.