COMP10002 Foundations of Algorithms

Transformer Explainer

A plain-English explanation of what a Transformer is, why attention sits at its core, and where this assignment fits into the larger model.

Short answer

A Transformer is a neural-network architecture for processing sequences such as text. It works by repeatedly taking a sequence of token representations, letting each position look at other positions through attention, then transforming those representations again.

The important idea is that each token does not get processed in isolation. Each position can build a context-aware representation by looking back across the sequence.

For this assignment, the main consequence is simple: the numbers for one token are allowed to depend on earlier tokens, and attention is the mechanism that decides how much each earlier token matters.

One layer of a Transformer

One Transformer layer is usually described as having two main parts:

  1. an attention block
  2. a feed-forward block

wrapped with residual connections and normalisation .

The full model repeats Transformer blocks many times. This assignment zooms in on the masked self-attention part inside one block, then reuses cached keys and values during generation.

Want To Read Paper-Style Block Diagrams?

If you want help reading the style of architecture figure that often appears in papers, read Reading LLM architecture diagrams.

Inside the attention block, each token representation is turned into query, key, and value vectors. The model then:

  1. compares queries against keys to get scores
  2. applies masking
  3. turns the scores into weights
  4. uses those weights to blend the value vectors into new outputs

That is the exact slice of the architecture that your code implements.

Why attention matters

Attention matters because language depends on context. The right interpretation of a word or token often depends on what came earlier.

For example, in a sentence like:

The animal did not cross the road because it was tired.

the model needs a way to connect “it” back to earlier context. Attention gives each position a structured way to decide which earlier tokens are most relevant.

Without attention, the model would have a much harder time making those long-range connections. Transformers became important because attention made it easier to model those dependencies in parallel and at scale.

Where this assignment fits

This assignment isolates the attention part of one simplified Transformer block.

You are given:

You implement:

  1. projection into $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$
  2. prompt attention scores
  3. prompt attention weights
  4. prompt attention outputs
  5. generated outputs using the KV cache

You do not implement:

That is why the assignment feels smaller than a “full Transformer” while still being a real part of one.

One-line summary

The shortest accurate summary is: this assignment asks you to implement the part of a Transformer that decides which earlier positions matter, then mixes information from those positions into a new output vector.

Generation and the KV cache

When a Transformer generates text, it does not want to recompute all previous keys and values from scratch at every step. Instead, it stores them in a KV cache.

That is what Stage 6 is modelling.

At generation step t, the model:

  1. computes the new token’s query, key, and value
  2. appends the new key and value to the cache
  3. scores the new query against the cached keys
  4. forms the new output from the cached values

This is still attention. The only difference is that the earlier keys and values are reused instead of recomputed.

If you want the assignment-specific version of that story, continue to the KV cache explainer.