Dot Products and Scores
A way to measure how strongly two vectors line up, often used to build attention scores.
Attention Is All You Need · 2026 A1
Required concept
How each token decides which earlier tokens matter most.
Attention is the mechanism that lets one token position “look back” at other positions and decide which ones matter for the next computation.
At the simplest level, attention does three things:
In the assignment, those three steps show up as:
In this assignment, you implement that pipeline directly:
\[\text{scores} \rightarrow \text{masks} \rightarrow \text{softmax weights} \rightarrow \text{weighted sums}\]Using the assignment names, one position supplies a query vector $\vec{\mathbf{Q}}_i$, the earlier positions supply key and value vectors $\vec{\mathbf{K}}_j$ and $\vec{\mathbf{V}}_j$, and attention decides how much of each value vector should contribute to the output.
flowchart LR
Q[Query vector at i] --> S[Scores against earlier j positions]
K[Key vectors at j] --> S
S --> M[Apply masks]
M --> W[Stable softmax]
W --> O[Weighted sum of value vectors]
V[Value vectors at j] --> O
The important simplifications are:
Modern chatbots generate text one token at a time. At each step the model has to decide which earlier tokens matter most. That “look back and focus” behaviour is attention.
For example:
Alice gave Bob the book because ___ was late.
The model needs a way to look back over earlier positions and decide which ones are relevant when forming the next internal representation. Attention is the mechanism that does that.
If you want the ingredients that feed attention, start with Q, K, and V projections. If you want the “why do some positions disappear?” rule, read masking.