COMP10002 Foundations of Algorithms

Transformer Explainer

A plain-English explanation of what a Transformer is, why attention sits at its core, and where this assignment fits into the larger model.

Short answer

A Transformer is a neural-network architecture for processing sequences such as text. It works by repeatedly taking a sequence of token representations, letting each position look at other positions through attention, then transforming those representations again.

The important idea is that each token does not get processed in isolation. Each position can build a context-aware representation by looking back across the sequence.

For this assignment, the main consequence is simple: the numbers for one token are allowed to depend on earlier tokens, and attention is the mechanism that decides how much each earlier token matters.

One layer of a Transformer

One Transformer layer is usually described as having two main parts:

an attention block
a feed-forward block

wrapped with residual connections and normalisation .

Input sequence

Token + position embeddings

Turn each token into an embedding and preserve its position in the sequence.

During generation

Earlier keys and values are kept in the KV cache .

Repeated Transformer blocks

Your assignment

Masked self-attention

Q, K, and V, scores, mask , softmax , weighted values.

Add + norm

Residual connection plus normalisation .

Feed-forward block

A small neural-network block applied separately to each position after attention.

Add + norm

The same residual-and-normalisation pattern happens again.

Prediction

Vocabulary scoring

Produce logits for all possible next tokens.

Next token

Pick or sample one output token, then repeat.

The full model repeats Transformer blocks many times. This assignment zooms in on the masked self-attention part inside one block, then reuses cached keys and values during generation.

Want To Read Paper-Style Block Diagrams?

If you want help reading the style of architecture figure that often appears in papers, read Reading LLM architecture diagrams.

Inside the attention block, each token representation is turned into query, key, and value vectors. The model then:

compares queries against keys to get scores
applies masking
turns the scores into weights
uses those weights to blend the value vectors into new outputs

That is the exact slice of the architecture that your code implements.

Why attention matters

Attention matters because language depends on context. The right interpretation of a word or token often depends on what came earlier.

For example, in a sentence like:

The animal did not cross the road because it was tired.

the model needs a way to connect “it” back to earlier context. Attention gives each position a structured way to decide which earlier tokens are most relevant.

Without attention, the model would have a much harder time making those long-range connections. Transformers became important because attention made it easier to model those dependencies in parallel and at scale.

Where this assignment fits

This assignment isolates the attention part of one simplified Transformer block.

You are given:

token embeddings
projection matrices Wq, Wk, and Wv
the prompt mask
generated token embeddings for Stage 6

You implement:

projection into $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$
prompt attention scores
prompt attention weights
prompt attention outputs
generated outputs using the KV cache

You do not implement:

tokenisation
training
multiple attention heads
feed-forward networks
residual connections
layer normalisation
vocabulary scoring and next-token selection

That is why the assignment feels smaller than a “full Transformer” while still being a real part of one.

One-line summary

The shortest accurate summary is: this assignment asks you to implement the part of a Transformer that decides which earlier positions matter, then mixes information from those positions into a new output vector.

Generation and the KV cache

When a Transformer generates text, it does not want to recompute all previous keys and values from scratch at every step. Instead, it stores them in a KV cache.

That is what Stage 6 is modelling.

At generation step t, the model:

computes the new token’s query, key, and value
appends the new key and value to the cache
scores the new query against the cached keys
forms the new output from the cached values

This is still attention. The only difference is that the earlier keys and values are reused instead of recomputed.

If you want the assignment-specific version of that story, continue to the KV cache explainer.