h₂ · WQ → q₂
h₂ · WK → k₂
h₂ · WV → v₂
For each generated token, you still compute one fresh query, key, and value.
Attention Is All You Need · 2026 A1
Required concept
Why generation can reuse past keys and values instead of recomputing them.
The KV cache is a saved list of past key and value vectors .
Without it, generation would keep recomputing K and V for every earlier token at every step. That would repeat work unnecessarily.
With a cache, each generated step only needs to:
It may or may not fully work, or have inconsistent math notation!
Generation step walkthrough
Move one token through the cache, one stage at a time
Step 1 of 5
Step 1: the new generated token only needs one fresh query, key, and value.
Cached keys
New token vector
Cached values
Compute $\vec{\mathbf{q}}$, $\vec{\mathbf{k}}$, and $\vec{\mathbf{v}}$ for the new token
Fresh work this step
Cost per generated token: 3 matrix-vector projections, so this scales as O(d²) in the vector size d.
Use the weights on cached values
A later query can reuse the cache
Current focus
Only one new query, key, and value are computed for the token being generated now.
Why this matters
Earlier prompt and generated cache entries stay reusable instead of being recomputed from scratch.
h₂ · WQ → q₂
h₂ · WK → k₂
h₂ · WV → v₂
For each generated token, you still compute one fresh query, key, and value.
keys: [k₀, k₁] → [k₀, k₁, k₂]
values: [v₀, v₁] → [v₀, v₁, v₂]
Earlier cache entries stay exactly as they were.
q₂ · [k₀, k₁, k₂] → [s₀, s₁, s₂]
The new query compares against the whole stored history, including the just-appended key.
[w₀, w₁, w₂] × [v₀, v₁, v₂] → output₂
The output is built from cached values, not from recomputing older token states.
h₃ · WQ → q₃
q₃ · [k₀, k₁, k₂] → scores
weights × [v₀, v₁, v₂] → output₃
The cache does not make q₃ free. It saves recomputing the older k and v entries.
That is why cached generation is “one new query against an ever-growing history” rather than “recompute everything from scratch”.
Without a cache, the model does not get to say “just compute $\vec{\mathbf{q}}_3$”.
At the next generation step, the input prefix is now the whole sequence so far. A plain Transformer forward pass processes that whole prefix again, which means it rebuilds the older positions’ activations too. So old k and v vectors such as $\vec{\mathbf{k}}_3$ and $\vec{\mathbf{v}}_2$ would be recomputed along with the new token’s work.
With a KV cache, those older $\vec{\mathbf{K}}/\vec{\mathbf{V}}$ vectors are already stored. The model still computes the new token’s fresh projections, but it reuses the saved history instead of rebuilding it.
The cache is best understood as “saving work you already did”.
Prompt keys and values do not change while generation continues. So once they exist, you can keep them in a cache and only compute the new generated token’s $\vec{\mathbf{K}}$ and $\vec{\mathbf{V}}$.