All concepts

Required concept

Causal and Padding Masks

The rules that decide which positions are ignored and why.

Also called: causal mask, padding maskComputation

Masking means “treat this position as if it is not available”.

The prompt stages use two masks at the same time.

Causal mask

When an LLM is generating text left-to-right, it must not use information from future tokens for a given query token. So when computing attention score for the query vector corresponding to prompt token $i$, the key vectors corresponding to prompt token $j$ where $j>i$ of $\mathbf{K}$ are masked out (forced to have weight 0.0 in the stage 4).

This is called causal (or autoregressive) self-attention. For prompt position i, positions with j > i are masked. This stops the model from “looking into the future”.

That is why prompt attention only looks backward or at the current token.

Padding mask

Usually we store a prompt in a fixed-size array, but not every slot is actually used (eg sentences have different lengths). The unused slots are called padding. A padding mask tells you which prompt positions are real (1) vs padding (0). If we did not have the padding mask, the padding would be “considered” as having important meaning, and our LLM would output a bunch of padding between other tokens. Not ideal!

THE FOLLOWING DEMO IS UNDER CONSTRUCTION.

It may or may not fully work, or have inconsistent math notation!

Prompt masking demo

Pick one query position and watch the rules remove cells

Step 1 of 3

Step 1: start with i = 1 as the current query position before either masking rule removes anything.

Prompt padding mask

mask[0] = 1 mask[1] = 1 mask[2] = 0

Current result

Key's 0 and 1 (columns) may attend to Query 1 (row). Column 2 is padding, so it stays masked.

current query position
still allowed after masking
removed by the causal rule
removed because the prompt mask is 0
Rows are queries and columns are keys. For prompt position i, the model first removes future positions, then removes padded positions. If mask[i] = 0, the whole score row is masked out.
Prompt masking worked example

Prompt mask: [1, 1, 0]

Inspect position: i = 1

1. Raw score row
j 0 1 2
i = 1 score_{1,0} score_{1,1} score_{1,2}

All three positions are still visible before masking.

2. After causal mask
j 0 1 2
i = 1 score_{1,0} score_{1,1} future

j = 2 is blocked because 2 > 1.

3. After padding mask
j 0 1 2
i = 1 score_{1,0} score_{1,1} padding

Final usable positions for i = 1: j = 0 and j = 1.

If the query position itself is padding
mask[2] = 0 then score_{2,j} -> -INFINITY for all j

At the score stage, a padded prompt position is fully masked. In the later weight/output stages, that then becomes a vector of zeros.