COMP10002 Foundations of Algorithms

Transformer History

Historical context for why attention and Transformers mattered, separated from the main implementation path so it stays readable rather than overwhelming.

Neural networks and AlexNet

Neural networks are programs built from many layers of simple numeric computations. A common layer looks like:

\[\text{output} = f(W\,\text{input} + b)\]

where:

That should already look a little familiar from this assignment. Your projection stages are also matrix multiplications with learned weights. In the full Transformer story, many of the important pieces are still built from the same basic ingredients:

So one way to read Transformers historically is: they are not a rejection of neural networks. They are a new neural-network architecture built from the same kind of numeric primitives, but arranged in a way that handles sequence context much better.

What “non-linear” means in plain English

Why bother with a non-linear f at all? Because if every layer were only matrix multiplies and additions, many layers would collapse into one big linear transform. The non-linear step stops that collapse and lets the model represent much richer behaviour.

Common modern activations include:

  • ReLU: max(0, x)
  • GELU: a smoother variant used in many Transformers

You do not implement those activations in this assignment, but they are part of the larger context around Transformer models.

In 2012, AlexNet helped convince the field that deep neural networks plus enough data and computation could work extremely well in practice. That wider deep-learning shift is part of the story that made later architectures like Transformers worth scaling up.

AlexNet itself was not a Transformer, and it was built for images rather than text. But historically it mattered because it pushed the field toward a mindset that also underlies modern large language models:

The 2017 paper

The Transformer architecture became widely known after Attention Is All You Need (Vaswani et al., 2017).

Its practical message was simple but powerful: strong sequence models did not have to process text strictly one token at a time. Attention could compare many positions with many other positions inside a layer, which fit modern matrix-heavy hardware much better.

This is exactly where the assignment connects back to the history:

So the page is not just saying “Transformers mattered”. It is saying that the algorithmic pipeline you are coding is a stripped-down version of the mechanism that made Transformers so important.

If you want to read around that moment directly:

Before vs after

Before Transformers, many sequence models updated an internal state step by step as they read text. That works, but:

After Transformers:

That second point is why so much of the Transformer pipeline looks like matrix algebra. The big practical win was not only that attention could model long-range context better, but also that its core operations were things hardware was already very good at:

Why the impact took time

The 2017 paper was framed around machine translation, not around a general-purpose chatbot. It took time for the community to realise that the same architecture, when scaled with more data and more compute, could become a much more general language model.

Part of that delayed impact is that the individual ingredients do not look magical on their own. Matrix multiplies, dot products, masking, and softmax are all fairly ordinary mathematical operations. What turned out to matter was the way those pieces were combined inside the attention mechanism and then scaled aggressively.

In hindsight, Transformers were one of those ideas whose full consequences were clearer only after years of scaling, experimentation, and productisation.