| j | 0 | 1 | 2 |
|---|---|---|---|
| i = 1 | score_{1,0} |
score_{1,1} |
score_{1,2} |
All three positions are still visible before masking.
Attention Is All You Need · 2026 A1
Required concept
The rules that decide which positions are ignored and why.
Masking means “treat this position as if it is not available”.
The prompt stages use two masks at the same time.
When an LLM is generating text left-to-right, it must not use information from future tokens for a given query token.
So when computing attention score for the query vector corresponding to prompt token $i$, the key vectors corresponding to prompt token $j$ where $j>i$ of $\mathbf{K}$ are masked out (forced to have weight 0.0 in the stage 4).
This is called causal (or autoregressive) self-attention. For prompt position i, positions with j > i are masked. This stops the model from “looking into the future”.
That is why prompt attention only looks backward or at the current token.
Usually we store a prompt in a fixed-size array, but not every slot is actually used (eg sentences have different lengths).
The unused slots are called padding.
A padding mask tells you which prompt positions are real (1) vs padding (0).
If we did not have the padding mask, the padding would be “considered” as having important meaning,
and our LLM would output a bunch of padding between other tokens. Not ideal!
It may or may not fully work, or have inconsistent math notation!
Prompt masking demo
Pick one query position and watch the rules remove cells
Step 1 of 3
Step 1: start with i = 1 as the current query position before either masking rule removes anything.
| query \\ key | j = 0 | j = 1 | j = 2 |
|---|---|---|---|
| i = 0 | \(\text{score}_{0,0}\) | \(\text{score}_{0,1}\) | \(\text{score}_{0,2}\) |
| i = 1 | \(\text{score}_{1,0}\) | \(\text{score}_{1,1}\) | \(\text{score}_{1,2}\) |
| i = 2 | \(\text{score}_{2,0}\) | \(\text{score}_{2,1}\) | \(\text{score}_{2,2}\) |
i, the model first removes future positions, then removes padded positions. If mask[i] = 0, the whole score row is masked out.Prompt mask: [1, 1, 0]
Inspect position: i = 1
| j | 0 | 1 | 2 |
|---|---|---|---|
| i = 1 | score_{1,0} |
score_{1,1} |
score_{1,2} |
All three positions are still visible before masking.
| j | 0 | 1 | 2 |
|---|---|---|---|
| i = 1 | score_{1,0} |
score_{1,1} |
future |
j = 2 is blocked because 2 > 1.
| j | 0 | 1 | 2 |
|---|---|---|---|
| i = 1 | score_{1,0} |
score_{1,1} |
padding |
Final usable positions for i = 1: j = 0 and j = 1.
mask[2] = 0
then
score_{2,j} -> -INFINITY for all j
At the score stage, a padded prompt position is fully masked. In the later weight/output stages, that then becomes a vector of zeros.