COMP10002 Foundations of Algorithms

Parameters and Model Scale

What “billions of parameters” means, where those numbers come from, and what they imply for memory.

What parameters are

When people say an LLM has “7 billion parameters”, they mean the model stores roughly 7 billion learned numbers.

These are not prompt-specific inputs. They are the model’s learned weights after training.

A compact mental model of parameters

An LLM is a big bundle of matrices full of numbers. At runtime you feed vectors in, the model multiplies by those matrices and applies simple functions, and training is the process of finding good values for those stored numbers.

Where parameters are used

In a Transformer-style model, parameters include:

Size examples

Toy-sized example

Suppose:

Then rough parameter counts look like:

That already gets you to roughly 24 million parameters even before counting every detail.

Rough 7B-style example

Suppose:

Then the model quickly becomes huge:

That is how you arrive at a “7B-ish” model size even with rough counting.

Real model examples

These examples are useful mainly for scale intuition. They are not a leaderboard, and parameter count is only one part of what makes a model strong.

GPT-3

OpenAI, 2020. GPT-3 is a good historical anchor because its paper made the scale very visible: 175 billion parameters.

It is older than current frontier models, but it is still one of the clearest public examples of what “hundreds of billions of parameters” looks like in practice.

Llama 3.1 405B

Meta, 2024. This is a modern open-weight example: 405 billion parameters, with a 128K context window .

It is useful here because it shows that open models now also exist at very large scales, not just small classroom or hobby sizes.

DeepSeek-V3

DeepSeek, 2024. This is a useful modern public example because its technical report gives concrete scale numbers: 671 billion total parameters, with about 37 billion activated per token.

It is also a good reminder that “model size” is no longer always one simple dense parameter count. Modern large models are often sparse or mixture-of-experts systems, so total parameters and active parameters can differ a lot.

So when people talk about model scale today, they may be referring to any mix of:

Memory estimates

Memory for stored parameters is roughly:

\[\text{parameter count} \times \text{bytes per number}\]

Common storage choices:

So a 6.5B-parameter model is roughly:

What those memory numbers leave out

Two caveats matter:

  • that estimate only counts stored weights, not activations or caches
  • training usually needs much more memory than inference because optimisation algorithms keep extra arrays around

For this assignment, the scale is tiny by comparison: d <= 64, the matrices are given to you as input, and you do not train anything.