COMP10002 Foundations of Algorithms

Reading LLM Architecture Diagrams

How to read the block diagrams that often appear in Transformer and LLM papers, without getting lost in all the boxes and arrows.

Short answer

When you see an LLM architecture figure in a paper, do not try to understand every box at once.

Read it as a data-flow diagram:

what goes in
what gets repeated
what the main sub-blocks are
what comes out

Most architecture figures are trying to show the shape of the computation, not the full implementation detail.

How to read them

Start with the arrows.

They usually tell you the highest-level story:

tokens go in
tokens become embeddings
those embeddings pass through repeated Transformer blocks
the final representation is used to produce scores over the next token

Then identify what kind of diagram it is.

Most paper figures are doing one of these jobs:

showing the overall pipeline from input text to output text
zooming into one Transformer block
zooming further into one attention sub-block
comparing two architectures side by side

If you know which of those jobs the figure is doing, it becomes much easier to read.

Hover or focus the main boxes in the figure below for short plain-English explanations of the main parts.

Transformer full architecture diagram with an encoder stack on the left and a decoder stack on the right, including self-attention, cross-attention, feed-forward blocks, norms, and embedding or projection stages. — A typical Transformer-style paper figure compresses a large computation into a few labelled boxes and repeated stacks. Read it first as a pipeline, then zoom into the sub-block you care about. Figure from “Transformer, full architecture” by dvgodoy, licensed CC BY 4.0, via Wikimedia Commons.

Look for the input and output first

Find the leftmost or bottommost input, then trace where the arrows eventually lead. That gives you the big picture before you worry about details.

Notice what is repeated

If a diagram says something like “×12”, “×24”, or “N layers”, it usually means “this same block is stacked many times”.

Read box labels as roles

Labels such as “self-attention”, “feed-forward”, “add & norm”, or “MLP” are usually naming the role of a sub-computation, not giving all the loop details.

Repeated blocks

A common source of confusion is that one box in a paper figure may really stand for a large repeated stack.

For example, if you see:

one Transformer block with a note like × 12
a bracket saying L layers
a tall repeated stack in the middle of the figure

that usually means the model applies the same kind of block many times in sequence.

The figure is not claiming there is literally only one attention computation. It is compressing many similar layers into one readable visual unit.

That is the same reason papers often draw one attention block, one feed-forward block, and one output head even though the real implementation may involve:

many layers
multiple attention heads
batch dimensions
training-only components not shown in the simplified figure

What diagrams often leave out

Paper diagrams are helpful, but they often omit the details you would need to code the model directly.

What commonly gets suppressed:

exact array shapes
whether vectors are rows or columns
where masking happens
whether softmax is row-wise
how many heads exist inside “multi-head attention”
what is different at training time versus generation time

Why Figures Feel Simpler Than Code

This is why a paper figure can feel easy to look at but hard to implement from directly. The figure tells you the broad structure; the equations and prose carry the operational detail.

If you want the equation-side version of the same problem, read Reading Transformer paper notation.

How this maps to the assignment

The architecture diagrams in papers usually contain much more than this assignment asks you to implement.

In a typical LLM figure, your assignment corresponds to only a small middle slice:

building Q, K, and V
computing attention scores
applying masking
applying stable softmax
using the weights to produce an attention output
reusing earlier keys and values through the KV cache

That is why the Transformer explainer is useful before or after this page: it shows the larger block structure, while this page is about how to read the style of figure that papers tend to use.