COMP10002 Foundations of Algorithms

Parameters and Model Scale

What “billions of parameters” means, where those numbers come from, and what they imply for memory.

What parameters are

When people say an LLM has “7 billion parameters”, they mean the model stores roughly 7 billion learned numbers.

These are not prompt-specific inputs. They are the model’s learned weights after training.

A compact mental model of parameters

An LLM is a big bundle of matrices full of numbers. At runtime you feed vectors in, the model multiplies by those matrices and applies simple functions, and training is the process of finding good values for those stored numbers.

Where parameters are used

In a Transformer-style model, parameters include:

the embedding table
the attention projection matrices (Wq, Wk, Wv, and usually an output projection)
the feed-forward layers in each Transformer block
the final layer that maps internal vectors to vocabulary scores

Size examples

Toy-sized example

Suppose:

vocabulary size $V = 10{,}000$
model width $d = 512$
layer count $L = 6$
feed-forward width $d_{\text{ff}} = 2048$

Then rough parameter counts look like:

embedding table: $10{,}000 \times 512 \approx 5.12$ million
attention projections per layer: $4(512 \times 512) \approx 1.05$ million
feed-forward per layer: $(512 \times 2048) + (2048 \times 512) \approx 2.10$ million

That already gets you to roughly 24 million parameters even before counting every detail.

Rough 7B-style example

Suppose:

$V \approx 32{,}000$
$d = 4096$
$L = 32$
$d_{\text{ff}} \approx 16{,}384$

Then the model quickly becomes huge:

embeddings: about 131 million
attention projections across layers: about 2.1 billion
feed-forward layers across layers: about 4.3 billion

That is how you arrive at a “7B-ish” model size even with rough counting.

Real model examples

These examples are useful mainly for scale intuition. They are not a leaderboard, and parameter count is only one part of what makes a model strong.

GPT-3

OpenAI, 2020. GPT-3 is a good historical anchor because its paper made the scale very visible: 175 billion parameters.

It is older than current frontier models, but it is still one of the clearest public examples of what “hundreds of billions of parameters” looks like in practice.

Llama 3.1 405B

Meta, 2024. This is a modern open-weight example: 405 billion parameters, with a 128K context window .

It is useful here because it shows that open models now also exist at very large scales, not just small classroom or hobby sizes.

DeepSeek-V3

DeepSeek, 2024. This is a useful modern public example because its technical report gives concrete scale numbers: 671 billion total parameters, with about 37 billion activated per token.

It is also a good reminder that “model size” is no longer always one simple dense parameter count. Modern large models are often sparse or mixture-of-experts systems, so total parameters and active parameters can differ a lot.

So when people talk about model scale today, they may be referring to any mix of:

parameter count
context window size
whether the model is open-weight or closed-source
whether the architecture is dense or sparse

Memory estimates

Memory for stored parameters is roughly:

\[\text{parameter count} \times \text{bytes per number}\]

Common storage choices:

FP16 or BF16: about 2 bytes per parameter
FP32: about 4 bytes per parameter

So a 6.5B-parameter model is roughly:

about 13 GB in 16-bit storage
about 26 GB in 32-bit storage

What those memory numbers leave out

Two caveats matter:

that estimate only counts stored weights, not activations or caches
training usually needs much more memory than inference because optimisation algorithms keep extra arrays around

For this assignment, the scale is tiny by comparison: d <= 64, the matrices are given to you as input, and you do not train anything.