All concepts

Required concept

Attention

How each token decides which earlier tokens matter most.

Also called: self-attention, single-head attentionFoundations

Attention is the mechanism that lets one token position “look back” at other positions and decide which ones matter for the next computation.

At the simplest level, attention does three things:

  1. compare one token with earlier tokens
  2. turn those relevance scores into weights
  3. use those weights to combine information from the earlier tokens

In the assignment, those three steps show up as:

In this assignment, you implement that pipeline directly:

\[\text{scores} \rightarrow \text{masks} \rightarrow \text{softmax weights} \rightarrow \text{weighted sums}\]

Using the assignment names, one position supplies a query vector $\vec{\mathbf{Q}}_i$, the earlier positions supply key and value vectors $\vec{\mathbf{K}}_j$ and $\vec{\mathbf{V}}_j$, and attention decides how much of each value vector should contribute to the output.

flowchart LR
    Q[Query vector at i] --> S[Scores against earlier j positions]
    K[Key vectors at j] --> S
    S --> M[Apply masks]
    M --> W[Stable softmax]
    W --> O[Weighted sum of value vectors]
    V[Value vectors at j] --> O

The important simplifications are:

Modern chatbots generate text one token at a time. At each step the model has to decide which earlier tokens matter most. That “look back and focus” behaviour is attention.

For example:

Alice gave Bob the book because ___ was late.

The model needs a way to look back over earlier positions and decide which ones are relevant when forming the next internal representation. Attention is the mechanism that does that.

If you want the ingredients that feed attention, start with Q, K, and V projections. If you want the “why do some positions disappear?” rule, read masking.