Causal and Padding Masks

The rules that decide which positions are ignored and why.

Also called: causal mask, padding maskComputation

Masking means “treat this position as if it is not available”.

The prompt stages use two masks at the same time.

Causal mask

When an LLM is generating text left-to-right, it must not use information from future tokens for a given query token. So when computing attention score for the query vector corresponding to prompt token $i$, the key vectors corresponding to prompt token $j$ where $j>i$ of $\mathbf{K}$ are masked out (forced to have weight 0.0 in the stage 4).

This is called causal (or autoregressive) self-attention. For prompt position i, positions with j > i are masked. This stops the model from “looking into the future”.

That is why prompt attention only looks backward or at the current token.

Padding mask

Usually we store a prompt in a fixed-size array, but not every slot is actually used (eg sentences have different lengths). The unused slots are called padding. A padding mask tells you which prompt positions are real (1) vs padding (0). If we did not have the padding mask, the padding would be “considered” as having important meaning, and our LLM would output a bunch of padding between other tokens. Not ideal!

THE FOLLOWING DEMO IS UNDER CONSTRUCTION.

It may or may not fully work, or have inconsistent math notation!

Step 1: start with i = 1 as the current query position before either masking rule removes anything.

query \\ key	j = 0	j = 1	j = 2
i = 0	$\text{score}_{0,0}$	$\text{score}_{0,1}$	$\text{score}_{0,2}$
i = 1	$\text{score}_{1,0}$	$\text{score}_{1,1}$	$\text{score}_{1,2}$
i = 2	$\text{score}_{2,0}$	$\text{score}_{2,1}$	$\text{score}_{2,2}$

current query position

still allowed after masking

removed by the causal rule

removed because the prompt mask is 0

Rows are queries and columns are keys. For prompt position i, the model first removes future positions, then removes padded positions. If mask[i] = 0, the whole score row is masked out.

Prompt masking worked example

Prompt mask: [1, 1, 0]

Inspect position: i = 1

1. Raw score row

j	0	1	2
i = 1	`score_{1,0}`	`score_{1,1}`	`score_{1,2}`

All three positions are still visible before masking.

2. After causal mask

j	0	1	2
i = 1	`score_{1,0}`	`score_{1,1}`	future

j = 2 is blocked because 2 > 1.

3. After padding mask

j	0	1	2
i = 1	`score_{1,0}`	`score_{1,1}`	padding

Final usable positions for i = 1: j = 0 and j = 1.

If the query position itself is padding

mask[2] = 0 then score_{2,j} -> -INFINITY for all j

At the score stage, a padded prompt position is fully masked. In the later weight/output stages, that then becomes a vector of zeros.

query \\ key	j = 0	j = 1	j = 2
i = 0	\(\text{score}_{0,0}\)	\(\text{score}_{0,1}\)	\(\text{score}_{0,2}\)
i = 1	\(\text{score}_{1,0}\)	\(\text{score}_{1,1}\)	\(\text{score}_{1,2}\)
i = 2	\(\text{score}_{2,0}\)	\(\text{score}_{2,1}\)	\(\text{score}_{2,2}\)

Causal and Padding Masks

Causal mask

Padding mask

THE FOLLOWING DEMO IS UNDER CONSTRUCTION.

Related explainers

Softmax and Stable Softmax

KV Cache