How raw scores become row-wise weights that sum to one.
Also called: row-wise softmaxComputation
Softmax
Scores can be negative or positive, and they do not automatically add up to anything useful. We want weights that:
are not negative, and
sum to 1 (so we can interpret them like percentages).
Softmax is the function that does this to create our weights. You can think of softmax as “turn scores into percentages”, where the percentages
in any given row of the attention weight matrix sum to 100%. Hence softmax works on a row vector of our $\mathbf{score}$ matrix to convert this into a new weight vector.
This ensures that the exponential ranges over $(0,1]$, because the exponent will be less or equal to zero.
Hence our $\mathbf{weight}$ matrix using stable softmax is
Add the Step 3 values to get the denominator, then divide each Step 3 value by that same denominator. The final row is non-negative and sums to 1.
What stable softmax changes
Only the intermediate exponentials change. The normalised weights stay the same.
Masking rule still matters
Only unmasked positions belong in the maximum search and the denominator. If nothing is unmasked, the whole row is zeros.
Two implementation details matter:
only unmasked positions belong in the maximum search and denominator
if a row has no unmasked positions, the whole output row is zeros
Softmax uses the exponential function because it has helpful properties: it is always positive, it grows smoothly with the score (so bigger scores get bigger weights), and it interacts nicely with calculus during training (you are not doing training).