COMP10002 Foundations of Algorithms
Transformer History
Historical context for why attention and Transformers mattered, separated from the main implementation path so it stays readable rather than overwhelming.
Neural networks and AlexNet
Neural networks are programs built from many layers of simple numeric computations. A common layer looks like:
\[\text{output} = f(W\,\text{input} + b)\]where:
- $W$ is a weight matrix
- $b$ is a bias vector
- $f(\cdot)$ is an activation function applied elementwise
That should already look a little familiar from this assignment. Your projection stages are also matrix multiplications with learned weights. In the full Transformer story, many of the important pieces are still built from the same basic ingredients:
- matrix multiplies to form
Q,K, andV - dot products to compare positions
- row-wise softmax to turn scores into weights
- weighted sums to produce new representations
So one way to read Transformers historically is: they are not a rejection of neural networks. They are a new neural-network architecture built from the same kind of numeric primitives, but arranged in a way that handles sequence context much better.
What “non-linear” means in plain English
Why bother with a non-linear f at all? Because if every layer were only matrix multiplies and additions, many layers would collapse into one big linear transform. The non-linear step stops that collapse and lets the model represent much richer behaviour.
Common modern activations include:
- ReLU:
max(0, x) - GELU: a smoother variant used in many Transformers
You do not implement those activations in this assignment, but they are part of the larger context around Transformer models.
In 2012, AlexNet helped convince the field that deep neural networks plus enough data and computation could work extremely well in practice. That wider deep-learning shift is part of the story that made later architectures like Transformers worth scaling up.
AlexNet itself was not a Transformer, and it was built for images rather than text. But historically it mattered because it pushed the field toward a mindset that also underlies modern large language models:
- large learned weight matrices can capture useful structure
- stacking many simple numeric layers can produce surprisingly strong behaviour
- hardware-friendly linear algebra operations can scale well when the data and compute are available
The 2017 paper
The Transformer architecture became widely known after Attention Is All You Need (Vaswani et al., 2017).
Its practical message was simple but powerful: strong sequence models did not have to process text strictly one token at a time. Attention could compare many positions with many other positions inside a layer, which fit modern matrix-heavy hardware much better.
This is exactly where the assignment connects back to the history:
- your Stage 2 builds the projections
Q,K, andV - your Stage 3 computes the scaled attention scores
- your Stage 4 applies row-wise stable softmax
- your Stage 5 forms the weighted output vectors
- your Stage 6 uses the KV cache that real autoregressive generation also relies on
So the page is not just saying “Transformers mattered”. It is saying that the algorithmic pipeline you are coding is a stripped-down version of the mechanism that made Transformers so important.
If you want to read around that moment directly:
- The Illustrated Transformer gives a strong visual overview of the architecture
- Attention Is All You Need is the original paper itself
Before vs after
Before Transformers, many sequence models updated an internal state step by step as they read text. That works, but:
- information must travel through many sequential steps
- long-range dependencies are harder to handle
- parallel hardware is harder to use efficiently
After Transformers:
- a layer can compare many token positions in parallel
- the architecture maps naturally onto large matrix multiplications
- scaling the model and data often keeps paying off
That second point is why so much of the Transformer pipeline looks like matrix algebra. The big practical win was not only that attention could model long-range context better, but also that its core operations were things hardware was already very good at:
- matrix multiplies for projections
- batched score computations
- row-wise softmax normalisation
- weighted sums for the final outputs
Why the impact took time
The 2017 paper was framed around machine translation, not around a general-purpose chatbot. It took time for the community to realise that the same architecture, when scaled with more data and more compute, could become a much more general language model.
Part of that delayed impact is that the individual ingredients do not look magical on their own. Matrix multiplies, dot products, masking, and softmax are all fairly ordinary mathematical operations. What turned out to matter was the way those pieces were combined inside the attention mechanism and then scaled aggressively.
In hindsight, Transformers were one of those ideas whose full consequences were clearer only after years of scaling, experimentation, and productisation.