Q, K, and V Projections

Why the same embedding is remixed into query, key, and value vectors.

Also called: projection, QKVFoundations

Each token embedding row is turned into three related vectors:

Query ($\vec{\mathbf{Q}}$): what this token is looking for
Key ($\vec{\mathbf{K}}$): what this token offers to be matched against
Value ($\vec{\mathbf{V}}$): the information that gets blended into the output

The assignment calls this a projection because each new vector is produced by multiplying the original embedding row by a different matrix:

\[\mathbf{Q} = \mathbf{X}\mathbf{W_q},\qquad \mathbf{K} = \mathbf{X}\mathbf{W_k},\qquad \mathbf{V} = \mathbf{X}\mathbf{W_v}\]

flowchart LR
    X[One embedding row x] --> WQ[Multiply by Wq]
    X --> WK[Multiply by Wk]
    X --> WV[Multiply by Wv]
    WQ --> Q[Query row]
    WK --> K[Key row]
    WV --> V[Value row]

One embedding row becomes Q, K, and V

One embedding row x

Multiply by Wq Query row

Multiply by Wk Key row

Multiply by Wv Value row

Here is one small worked example.

Input vector `x`

One token embedding row.

Weight matrix `W`

1	10
0	-1

Each output component uses one column of W.

Projected vector `y`

Hover 2 or 17 to inspect that component inline.

Hover or focus 2 or 17 above to highlight the ingredients of that output component.

How the first output component is computed y_0 = 2·1 + 3·0 = 2

Use the first column of W, multiply it entry-by-entry against the input row, then add the results.

How the second output component is computed y_1 = 2·10 + 3·(-1) = 17

Use the second column of W, multiply it entry-by-entry against the same input row, then add the results.

If you like geometric intuition, the matrix is taking weighted combinations of the original coordinates. Hover an output value to inspect one component inline.

Worked projection example

Input vector `x`

(2, 3)

One token embedding row.

Weight matrix `W`

[ [ 1, 10 ],
  [ 0, -1 ] ]

Each output component uses one column of W.

Projected vector `y`

(2, 17)

The same input row produces both output components.

First output component y_0 = 2·1 + 3·0 = 2

Use the first column of W, multiply it entry-by-entry against x, then add the results.

Second output component y_1 = 2·10 + 3·(-1) = 17

Use the second column of W, multiply it entry-by-entry against the same input row, then add the results.

Why three versions of the same token? Because attention needs one representation for matching, one for being matched against, and one for the information actually carried forward.

Q, K, and V Projections

Input vector x

Weight matrix W

Projected vector y

Input vector x

Weight matrix W

Projected vector y

Related explainers

Dot Products and Scores

Attention

Input vector `x`

Weight matrix `W`

Projected vector `y`

Input vector `x`

Weight matrix `W`

Projected vector `y`