Input vector x
| 2 | 3 |
One token embedding row.
Attention Is All You Need · 2026 A1
Required concept
Why the same embedding is remixed into query, key, and value vectors.
Each token embedding row is turned into three related vectors:
The assignment calls this a projection because each new vector is produced by multiplying the original embedding row by a different matrix:
\[\mathbf{Q} = \mathbf{X}\mathbf{W_q},\qquad \mathbf{K} = \mathbf{X}\mathbf{W_k},\qquad \mathbf{V} = \mathbf{X}\mathbf{W_v}\]
flowchart LR
X[One embedding row x] --> WQ[Multiply by Wq]
X --> WK[Multiply by Wk]
X --> WV[Multiply by Wv]
WQ --> Q[Query row]
WK --> K[Key row]
WV --> V[Value row]
x
Wq
Query row
Wk
Key row
Wv
Value row
Here is one small worked example.
x| 2 | 3 |
One token embedding row.
W| 1 | 10 |
| 0 | -1 |
Each output component uses one column of W.
yHover 2 or 17 to inspect that component inline.
Hover or focus 2 or 17 above to highlight the ingredients of that output component.
y_0 = 2·1 + 3·0 = 2
Use the first column of W, multiply it entry-by-entry against the input row, then add the results.
y_1 = 2·10 + 3·(-1) = 17
Use the second column of W, multiply it entry-by-entry against the same input row, then add the results.
x(2, 3)
One token embedding row.
W[ [ 1, 10 ],
[ 0, -1 ] ]
Each output component uses one column of W.
y(2, 17)
The same input row produces both output components.
y_0 = 2·1 + 3·0 = 2
Use the first column of W, multiply it entry-by-entry against x, then add the results.
y_1 = 2·10 + 3·(-1) = 17
Use the second column of W, multiply it entry-by-entry against the same input row, then add the results.
Why three versions of the same token? Because attention needs one representation for matching, one for being matched against, and one for the information actually carried forward.