COMP10002 Foundations of Algorithms
Transformer Explainer
A plain-English explanation of what a Transformer is, why attention sits at its core, and where this assignment fits into the larger model.
Short answer
A Transformer is a neural-network architecture for processing sequences such as text. It works by repeatedly taking a sequence of token representations, letting each position look at other positions through attention, then transforming those representations again.
The important idea is that each token does not get processed in isolation. Each position can build a context-aware representation by looking back across the sequence.
For this assignment, the main consequence is simple: the numbers for one token are allowed to depend on earlier tokens, and attention is the mechanism that decides how much each earlier token matters.
One layer of a Transformer
One Transformer layer is usually described as having two main parts:
- an attention block
- a feed-forward block
wrapped with residual connections and normalisation .
Input sequence
Token + position embeddings
Turn each token into an embedding and preserve its position in the sequence.
Earlier keys and values are kept in the KV cache .
Repeated Transformer blocks
Your assignmentMasked self-attention
Q, K, and V, scores, mask , softmax , weighted values.
Add + norm
Residual connection plus normalisation .
Feed-forward block
A small neural-network block applied separately to each position after attention.
Add + norm
The same residual-and-normalisation pattern happens again.
Prediction
Vocabulary scoring
Produce logits for all possible next tokens.
Next token
Pick or sample one output token, then repeat.
Want To Read Paper-Style Block Diagrams?
If you want help reading the style of architecture figure that often appears in papers, read Reading LLM architecture diagrams.
Inside the attention block, each token representation is turned into query, key, and value vectors. The model then:
- compares queries against keys to get scores
- applies masking
- turns the scores into weights
- uses those weights to blend the value vectors into new outputs
That is the exact slice of the architecture that your code implements.
Why attention matters
Attention matters because language depends on context. The right interpretation of a word or token often depends on what came earlier.
For example, in a sentence like:
The animal did not cross the road because it was tired.
the model needs a way to connect “it” back to earlier context. Attention gives each position a structured way to decide which earlier tokens are most relevant.
Without attention, the model would have a much harder time making those long-range connections. Transformers became important because attention made it easier to model those dependencies in parallel and at scale.
Where this assignment fits
This assignment isolates the attention part of one simplified Transformer block.
You are given:
- token embeddings
- projection matrices
Wq,Wk, andWv - the prompt mask
- generated token embeddings for Stage 6
You implement:
- projection into $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$
- prompt attention scores
- prompt attention weights
- prompt attention outputs
- generated outputs using the KV cache
You do not implement:
- tokenisation
- training
- multiple attention heads
- feed-forward networks
- residual connections
- layer normalisation
- vocabulary scoring and next-token selection
That is why the assignment feels smaller than a “full Transformer” while still being a real part of one.
One-line summary
The shortest accurate summary is: this assignment asks you to implement the part of a Transformer that decides which earlier positions matter, then mixes information from those positions into a new output vector.
Generation and the KV cache
When a Transformer generates text, it does not want to recompute all previous keys and values from scratch at every step. Instead, it stores them in a KV cache.
That is what Stage 6 is modelling.
At generation step t, the model:
- computes the new token’s query, key, and value
- appends the new key and value to the cache
- scores the new query against the cached keys
- forms the new output from the cached values
This is still attention. The only difference is that the earlier keys and values are reused instead of recomputed.
If you want the assignment-specific version of that story, continue to the KV cache explainer.