COMP10002 Foundations of Algorithms

Transformer History

Historical context for why attention and Transformers mattered, separated from the main implementation path so it stays readable rather than overwhelming.

Neural networks and AlexNet

Neural networks are programs built from many layers of simple numeric computations. A common layer looks like:

\[\text{output} = f(W\,\text{input} + b)\]

where:

$W$ is a weight matrix
$b$ is a bias vector
$f(\cdot)$ is an activation function applied elementwise

That should already look a little familiar from this assignment. Your projection stages are also matrix multiplications with learned weights. In the full Transformer story, many of the important pieces are still built from the same basic ingredients:

matrix multiplies to form Q, K, and V
dot products to compare positions
row-wise softmax to turn scores into weights
weighted sums to produce new representations

So one way to read Transformers historically is: they are not a rejection of neural networks. They are a new neural-network architecture built from the same kind of numeric primitives, but arranged in a way that handles sequence context much better.

What “non-linear” means in plain English

Why bother with a non-linear f at all? Because if every layer were only matrix multiplies and additions, many layers would collapse into one big linear transform. The non-linear step stops that collapse and lets the model represent much richer behaviour.

Common modern activations include:

ReLU: max(0, x)
GELU: a smoother variant used in many Transformers

You do not implement those activations in this assignment, but they are part of the larger context around Transformer models.

In 2012, AlexNet helped convince the field that deep neural networks plus enough data and computation could work extremely well in practice. That wider deep-learning shift is part of the story that made later architectures like Transformers worth scaling up.

AlexNet itself was not a Transformer, and it was built for images rather than text. But historically it mattered because it pushed the field toward a mindset that also underlies modern large language models:

large learned weight matrices can capture useful structure
stacking many simple numeric layers can produce surprisingly strong behaviour
hardware-friendly linear algebra operations can scale well when the data and compute are available

The 2017 paper

The Transformer architecture became widely known after Attention Is All You Need (Vaswani et al., 2017).

Its practical message was simple but powerful: strong sequence models did not have to process text strictly one token at a time. Attention could compare many positions with many other positions inside a layer, which fit modern matrix-heavy hardware much better.

This is exactly where the assignment connects back to the history:

your Stage 2 builds the projections Q, K, and V
your Stage 3 computes the scaled attention scores
your Stage 4 applies row-wise stable softmax
your Stage 5 forms the weighted output vectors
your Stage 6 uses the KV cache that real autoregressive generation also relies on

So the page is not just saying “Transformers mattered”. It is saying that the algorithmic pipeline you are coding is a stripped-down version of the mechanism that made Transformers so important.

If you want to read around that moment directly:

The Illustrated Transformer gives a strong visual overview of the architecture
Attention Is All You Need is the original paper itself

Before vs after

Before Transformers, many sequence models updated an internal state step by step as they read text. That works, but:

information must travel through many sequential steps
long-range dependencies are harder to handle
parallel hardware is harder to use efficiently

After Transformers:

a layer can compare many token positions in parallel
the architecture maps naturally onto large matrix multiplications
scaling the model and data often keeps paying off

That second point is why so much of the Transformer pipeline looks like matrix algebra. The big practical win was not only that attention could model long-range context better, but also that its core operations were things hardware was already very good at:

matrix multiplies for projections
batched score computations
row-wise softmax normalisation
weighted sums for the final outputs

Why the impact took time

The 2017 paper was framed around machine translation, not around a general-purpose chatbot. It took time for the community to realise that the same architecture, when scaled with more data and more compute, could become a much more general language model.

Part of that delayed impact is that the individual ingredients do not look magical on their own. Matrix multiplies, dot products, masking, and softmax are all fairly ordinary mathematical operations. What turned out to matter was the way those pieces were combined inside the attention mechanism and then scaled aggressively.

In hindsight, Transformers were one of those ideas whose full consequences were clearer only after years of scaling, experimentation, and productisation.