Attention

How each token decides which earlier tokens matter most.

Also called: self-attention, single-head attentionFoundations

Attention is the mechanism that lets one token position “look back” at other positions and decide which ones matter for the next computation.

At the simplest level, attention does three things:

compare one token with earlier tokens
turn those relevance scores into weights
use those weights to combine information from the earlier tokens

In the assignment, those three steps show up as:

scores
weights
outputs

In this assignment, you implement that pipeline directly:

\[\text{scores} \rightarrow \text{masks} \rightarrow \text{softmax weights} \rightarrow \text{weighted sums}\]

Using the assignment names, one position supplies a query vector $\vec{\mathbf{Q}}_i$, the earlier positions supply key and value vectors $\vec{\mathbf{K}}_j$ and $\vec{\mathbf{V}}_j$, and attention decides how much of each value vector should contribute to the output.

flowchart LR
    Q[Query vector at i] --> S[Scores against earlier j positions]
    K[Key vectors at j] --> S
    S --> M[Apply masks]
    M --> W[Stable softmax]
    W --> O[Weighted sum of value vectors]
    V[Value vectors at j] --> O

The important simplifications are:

there is only one attention head
you are given the embeddings and the projection matrices
you do not predict the next English token
you only implement the attention block itself

Modern chatbots generate text one token at a time. At each step the model has to decide which earlier tokens matter most. That “look back and focus” behaviour is attention.

For example:

Alice gave Bob the book because ___ was late.

The model needs a way to look back over earlier positions and decide which ones are relevant when forming the next internal representation. Attention is the mechanism that does that.

If you want the ingredients that feed attention, start with Q, K, and V projections. If you want the “why do some positions disappear?” rule, read masking.

Attention

Related explainers

Dot Products and Scores

Softmax and Stable Softmax

Causal and Padding Masks