All concepts

Required concept

Softmax and Stable Softmax

How raw scores become row-wise weights that sum to one.

Also called: row-wise softmaxComputation

Softmax

Scores can be negative or positive, and they do not automatically add up to anything useful. We want weights that:

are not negative, and
sum to 1 (so we can interpret them like percentages).

Softmax is the function that does this to create our weights. You can think of softmax as “turn scores into percentages”, where the percentages in any given row of the attention weight matrix sum to 100%. Hence softmax works on a row vector of our $\mathbf{score}$ matrix to convert this into a new weight vector.

\[\textbf{unstable_weight} = \begin{bmatrix} \text{unstable_weight}_{0,0} & \dots & \text{unstable_weight}_{0,n-1} \\ \vdots & \ddots & \vdots \\ \text{unstable_weight}_{n-1,0} & \dots & \text{unstable_weight}_{n-1, n-1} \end{bmatrix} = \begin{bmatrix} \text{unstable_softmax}(\overrightarrow{\mathbf{score}}_0) \\ \vdots \\ \text{unstable_softmax}(\overrightarrow{\mathbf{score}}_{n-1}) \end{bmatrix}\] \[\text{unstable_weight}_{i,j} = \frac{e^{\text{score}_{i,j}}}{\sum_{u=0}^{n-1} e^{\text{score}_{i,u}}}\]

Here, $e$ is the mathematical constant $e \approx 2.71828$, the base of the natural logarithm.

Stable Softmax

The function $e^x$ grows extremely quickly. In C, if scores are large, exp(score) can overflow (become infinity).

To avoid this, we can modify our exponential by subtracting the maximum score in the row:

\[e^{\text{score}_{i,j}} \to e^{\text{score}_{i,j} - \max(\overrightarrow{\textbf{score}}_i)}\]

This ensures that the exponential ranges over $(0,1]$, because the exponent will be less or equal to zero. Hence our $\mathbf{weight}$ matrix using stable softmax is

\[\textbf{weight} = \begin{bmatrix} \text{weight}_{0,0} & \dots & \text{weight}_{0,n-1} \\ \vdots & \ddots & \vdots \\ \text{weight}_{n-1,0} & \dots & \text{weight}_{n-1, n-1} \end{bmatrix} = \begin{bmatrix} \text{softmax}(\overrightarrow{\mathbf{score}}_0) \\ \vdots \\ \text{softmax}(\overrightarrow{\mathbf{score}}_{n-1}) \end{bmatrix}\] \[\text{weight}_{i,j} = \frac{e^{\text{score}_{i,j}-\max(\overrightarrow{\textbf{score}}_i)}}{\sum_{u=0}^{n-1} e^{\text{score}_{i,u} - \max(\overrightarrow{\textbf{score}}_i)}}\]

For this assignment you compute softmax using exp() from <math.h>.

Softmax is easiest to remember as turning scores into percentages.

Worked row: from scores to probabilities

1 Raw scores
2 Shift and exponentiate
3 Divide by the row sum

Step 1

Start with the raw scores

A 2

B 0

C -1

A already has the strongest score, so it starts with an advantage.

Step 2

Shift and exponentiate

Subtract the row maximum 2 first:

(2, 0, -1) → (0, -2, -3)

e⁰ 1.00

e⁻² 0.14

e⁻³ 0.05

The gap gets stretched before the row is normalised.

Step 3

Normalise into weights

Add the exponentials to get the row total, then divide each one by that total:

1.00 + 0.14 + 0.05 = 1.19

(1.00, 0.14, 0.05) / 1.19 → (0.84, 0.11, 0.05)

0.84

0.11

0.05

After normalisation, the row sums to 1, and A takes most of that total weight.

Supporting intuition

Why exponentials exaggerate differences

Softmax applies exp(...) before normalising, so a score that is only a bit larger can become much more dominant.

Stable softmax uses a shifted row (2, 0, -1) → (0, -2, -3)

The final weights stay the same Subtracting the same constant from every entry changes the exponentials, but not the normalised result.

Stable softmax worked example

1. Start with one score row

scores = (2, 0, -1)

A has the largest score, so it starts with the biggest advantage.

2. Shift by the row maximum

max = 2
(2, 0, -1) → (0, -2, -3)

Subtracting the same maximum from every entry keeps the final weights the same, but makes the exponentials numerically safer.

3. Exponentiate the shifted row

(e⁰, e⁻², e⁻³) → (1.00, 0.14, 0.05)

The larger original score still turns into the largest exponential term.

4. Normalise into weights

sum = 1.00 + 0.14 + 0.05 = 1.19
weights = (1.00, 0.14, 0.05) / 1.19
weights ≈ (0.84, 0.11, 0.05)

Add the Step 3 values to get the denominator, then divide each Step 3 value by that same denominator. The final row is non-negative and sums to 1.

What stable softmax changes

Only the intermediate exponentials change. The normalised weights stay the same.

Masking rule still matters

Only unmasked positions belong in the maximum search and the denominator. If nothing is unmasked, the whole row is zeros.

Two implementation details matter:

only unmasked positions belong in the maximum search and denominator
if a row has no unmasked positions, the whole output row is zeros

Softmax uses the exponential function because it has helpful properties: it is always positive, it grows smoothly with the score (so bigger scores get bigger weights), and it interacts nicely with calculus during training (you are not doing training).