All concepts

Required concept

KV Cache

Why generation can reuse past keys and values instead of recomputing them.

Also called: generation cacheComputation

The KV cache is a saved list of past key and value vectors .

Without it, generation would keep recomputing K and V for every earlier token at every step. That would repeat work unnecessarily.

With a cache, each generated step only needs to:

  1. compute the new token’s $\vec{\mathbf{Q}}_{n+t}$, $\vec{\mathbf{K}}_{n+t}$, and $\vec{\mathbf{V}}_{n+t}$
  2. append the new key and value vectors to the cache
  3. score the new query vector against the cached keys
  4. build the weighted sum from the cached values

THE FOLLOWING DEMO IS UNDER CONSTRUCTION.

It may or may not fully work, or have inconsistent math notation!

Generation step walkthrough

Move one token through the cache, one stage at a time

Step 1 of 5

Step 1: the new generated token only needs one fresh query, key, and value.

  1. 1. Fresh q/k/v
  2. 2. Append k/v
  3. 3. Score with cache
  4. 4. Mix cached values
  5. 5. Reuse on next token

Cached keys

$\vec{\mathbf{k}}_0$
$\vec{\mathbf{k}}_1$
$\vec{\mathbf{k}}_2$

New token vector

$\vec{\mathbf{h}}_0$
$\vec{\mathbf{q}}_2$ $\vec{\mathbf{k}}_2$ $\vec{\mathbf{v}}_2$

Cached values

$\vec{\mathbf{v}}_0$
$\vec{\mathbf{v}}_1$
$\vec{\mathbf{v}}_2$

Compute $\vec{\mathbf{q}}$, $\vec{\mathbf{k}}$, and $\vec{\mathbf{v}}$ for the new token

Fresh work this step

$\vec{\mathbf{h}}_2$ · WQ q₂ h₂ · WK k₂ h₂ · WV v₂

Cost per generated token: 3 matrix-vector projections, so this scales as O(d²) in the vector size d.

$\begin{bmatrix}\vec{\mathbf{q}}_2\end{bmatrix} \begin{bmatrix} \vec{\mathbf{k}}_0 & \vec{\mathbf{k}}_1 & \vec{\mathbf{k}}_2 \end{bmatrix} \implies \begin{bmatrix} \overrightarrow{\textbf{score}}_2 \end{bmatrix} $

Use the weights on cached values

$\begin{bmatrix}\vec{\mathbf{w}}_0 \\ \vec{\mathbf{w}}_1 \\ \vec{\mathbf{w}}_2 \end{bmatrix} \begin{bmatrix} \vec{\mathbf{v}}_0 & \vec{\mathbf{v}}_1 & \vec{\mathbf{v}}_2 \end{bmatrix} \implies \overrightarrow{\textbf{output}}_2 $

A later query can reuse the cache

Current focus

Only one new query, key, and value are computed for the token being generated now.

Why this matters

Earlier prompt and generated cache entries stay reusable instead of being recomputed from scratch.

The cache grows by one key and one value per generated token. The model only forms one fresh query, then reuses the stored history for scoring and value lookup.
KV cache worked example
1. Fresh work for the new token
h₂ · WQ → q₂
h₂ · WK → k₂
h₂ · WV → v₂

For each generated token, you still compute one fresh query, key, and value.

2. Append only the new key and value
keys:   [k₀, k₁]   →   [k₀, k₁, k₂]
values: [v₀, v₁]   →   [v₀, v₁, v₂]

Earlier cache entries stay exactly as they were.

3. Score the new query against cached keys
q₂ · [k₀, k₁, k₂] → [s₀, s₁, s₂]

The new query compares against the whole stored history, including the just-appended key.

4. Use those weights on cached values
[w₀, w₁, w₂] × [v₀, v₁, v₂] → output₂

The output is built from cached values, not from recomputing older token states.

5. The next token reuses the same cache
h₃ · WQ → q₃
q₃ · [k₀, k₁, k₂] → scores
weights × [v₀, v₁, v₂] → output₃

The cache does not make q₃ free. It saves recomputing the older k and v entries.

That is why cached generation is “one new query against an ever-growing history” rather than “recompute everything from scratch”.

Why no-cache generation recomputes old K and V

Without a cache, the model does not get to say “just compute $\vec{\mathbf{q}}_3$”.

At the next generation step, the input prefix is now the whole sequence so far. A plain Transformer forward pass processes that whole prefix again, which means it rebuilds the older positions’ activations too. So old k and v vectors such as $\vec{\mathbf{k}}_3$ and $\vec{\mathbf{v}}_2$ would be recomputed along with the new token’s work.

With a KV cache, those older $\vec{\mathbf{K}}/\vec{\mathbf{V}}$ vectors are already stored. The model still computes the new token’s fresh projections, but it reuses the saved history instead of rebuilding it.

Why the cache exists at all

The cache is best understood as “saving work you already did”.

Prompt keys and values do not change while generation continues. So once they exist, you can keep them in a cache and only compute the new generated token’s $\vec{\mathbf{K}}$ and $\vec{\mathbf{V}}$.