The Softmax Function Derivative (Part 1)

Introduction

This post demonstrates the calculations behind the evaluation of the Softmax Derivative using Python. It is based on the excellent article by Eli Bendersky which can be found here.

The Softmax Function

The softmax function simply takes a vector of N dimensions and returns a probability distribution also of N dimensions. Each element of the output is in the range (0,1) and the sum of the elements of N is 1.0.

Each element of the output is given by the formula:

\sigma(z_{i})=\frac{e^{z_{i}}} {\sum_{j=1}^K e^{z_{j}}}

See https://en.wikipedia.org/wiki/Softmax_function for more details.

import numpy as np
x = np.random.random([5])

def softmax_basic(z):
    exps = np.exp(z)
    sums = np.sum(exps)
    return np.divide(exps, sums)

softmax_basic(x)

This should generate an output that looks something like this:

array([0.97337094, 0.85251098, 0.62495691, 0.63957056, 0.6969253 ])

We expect that the sum of those will be (close to) 1.0:

np.sum(softmax_basic(x))

And so it is:

1.0

Calculating the derivative

We need to calculate the partial derivative of the probability outputs S_{i} with respect to each of the inputs x_{j}. For example:

\frac{\delta{S_{i}}}{\delta{x_{j}}}=\frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}}

Following Bendersky’s derivation, we need to use the quotient rule for derivatives:

\text{If }f(x)=\frac{g(x)}{h(x)} \text{ then:}

f'(x)=\frac{g'(x)h(x)-h'(x)g(x)}{[h(x)]^{2}}

From the Softmax function:

\begin{array}{rcl} g_{i} & = & e^{x_{i}}\\ h_{i} & = & \sum_{k=1}^N e^{x_{k}} \end{array}

The derivatives of these functions with respect to x_{j} are:

\frac{\delta{g_{i}}}{\delta{x_{j}}} = \left \{ \begin{aligned} &e^{x_{j}}, && \text{if}\ i=j \\ &0, && \text{otherwise} \end{aligned} \right.

and

\frac{\delta{h_{i}}}{\delta{x_{j}}}=\frac{\delta(e^{x_{1}}+e^{x_{2}}+...+e^{x_{N}})}{\delta{x_{j}}}=e^{x_{j}}

Now we have to evalutate the quotient rule for the two seperate cases where i=j and where i\neq{j}:

Starting with i=j:

\frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}}=\frac{e^{x_{j}}\sum_{k=1}^N e^{x_{k}}-e^{x_{j}}e^{x_{i}}}{\left[\sum_{k=1}^N e^{x_{k}}\right]^{2}}

Now we’ll just simplify this a bit:

\begin{array}{rcl} \frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}} & = & \frac{e^{x_{j}}\left(\sum_{k=1}^N e^{x_{k}}-e^{x_{i}}\right)}{\left[\sum_{k=1}^N e^{x_{k}}\right]^{2}}\\ &=&\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}} \dot{}\frac{\sum_{k=1}^N e^{x_{k}}-e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\\ &=&\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}} \left(\frac{\sum_{k=1}^N e^{x_{k}}}{\sum_{k=1}^N e^{x_{k}}}-\frac{e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\right)\\ &=&\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}} \left(1-\frac{e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\right)\\ &=&\sigma(x_{j})(1-\sigma(x_{i})) \end{array}

For the case where i\neq j:

\begin{array}{rcl} \frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}}&=&\frac{0-e^{x_{j}}e^{x_{i}}}{\left[\sum_{k=1}^N e^{x_{k}}\right]^{2}}\\ &=&0-\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}}\dot{}\frac{e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\\ &=&0-\sigma(x_{j})\sigma(x_{i}) \end{array}

Here are some examples of those function:

S = softmax_basic(x)

# where i = j
div = S[0] * (1 - S[0])
print("S[0] :", S[0], "div :", div)

div = S[1] * (1 - S[1])
print("S[1] :", S[1], "div :", div)

div = S[2] * (1 - S[2])
print("S[2] :", S[2], "div :", div)

# where i != j
div = 0 - (S[0] * S[1])
print("S[0] :", S[0], "S[1] :", S[1], "div :", div)

div = 0 - (S[1] * S[2])
print("S[1] :", S[1], "S[2] :", S[2], "div :", div)

div = 0 - (S[2] * S[3])
print("S[2] :", S[2], "S[3] :", S[3], "div :", div)

Here’s the output (yours may have different numbers):

S[0] : 0.16205870737014744 div : 0.13579568273566436
S[1] : 0.1530827392322831 div : 0.12964841418142392
S[2] : 0.22661095862407418 div : 0.1752584320555523
S[0] : 0.16205870737014744 S[1] : 0.1530827392322831 div : -0.024808390840665155
S[1] : 0.1530827392322831 S[2] : 0.22661095862407418 div : -0.034690226286226845
S[2] : 0.22661095862407418 S[3] : 0.226342511074585 div : -0.051291693411991836

Let’s tidy up the formula a little so we can come up with a decent implementation in the code.

Looking at the case where i=j:

\frac{\delta{S_{i}}}{\delta{x_{j}}}=\sigma(x_{j})(1-\sigma(x_{i}))\\ =\sigma(x_{j}) - \sigma(x_{j})\sigma(x_{i})

And where i\neq j

\frac{\delta{S_{i}}}{\delta{x_{j}}}=0-\sigma(x_{j})\sigma(x_{i})

We can generalise that formula by calculating the Jacobian matrix. This matrix will look like this:

\frac{\delta{S}}{\delta{x}} = \left [ \begin{array}{ccc} \frac{\delta{S_{1}}}{\delta{x_{1}}} & \ldots & \frac{\delta{S_{1}}}{\delta{x_{N}}} \\ \ldots & \frac{\delta{S_{i}}}{\delta{x_{j}}} & \ldots \\ \frac{\delta{S_{N}}}{\delta{x_{1}}} & \ldots & \frac{\delta{S_{N}}}{\delta{x_{N}}} \end{array} \right ]

\frac{\delta{S}}{\delta{x}} = \left [ \begin{array}{ccc} \sigma(x_{1}) - \sigma(x_{1})\sigma(x_{1}) & \ldots & 0-\sigma(x_{1})\sigma(x_{N}) \\ \ldots & \sigma(x_{j}) - \sigma(x_{j})\sigma(x_{i}) & \ldots \\ 0-\sigma(x_{N})\sigma(x_{1}) & \ldots & \sigma(x_{N}) - \sigma(x_{N})\sigma(x_{N}) \end{array} \right ]

\frac{\delta{S}}{\delta{x}} = \left [ \begin{array}{ccc} \sigma(x_{1}) & \ldots & 0 \\ \ldots & \sigma(x_{j}) & \ldots \\ 0 & \ldots & \sigma(x_{N})\end{array} \right ] - \left[ \begin{array}{ccc} \sigma(x_{1})\sigma(x_{1}) & \ldots & \sigma(x_{1})\sigma(x_{N}) \\ \ldots & \sigma(x_{j})\sigma(x_{i}) & \ldots \\ \sigma(x_{N})\sigma(x_{1}) & \ldots & \sigma(x_{N})\sigma(x_{N}) \end{array} \right ]

The matrix on the left is simply the vector S laid out along a diagonal. Numpy provides a diag function to do this for us:

np.diag(S)

And that generates a matrix for us:

array([[0.16205871, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.15308274, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.22661096, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.22634251, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.23190508]])

The matrix on the right is comprised of every element in S multiplied by all the elements in S. So we can create the first matrix by repeating S in rows, create the second matrix by repeating S in columns (i.e. transposing the first matrix) and finally, multiplying them together element-wise. Numpy comes to our help again by providing a tile function:

S_vector = S.reshape(S.shape[0],1)
S_matrix = np.tile(S_vector,S.shape[0])
print(S_matrix, '\n')
print(np.transpose(S_matrix))

And the output of this (again, yours will have different numbers due to the random initialisation):

[[0.16205871 0.16205871 0.16205871 0.16205871 0.16205871]
 [0.15308274 0.15308274 0.15308274 0.15308274 0.15308274]
 [0.22661096 0.22661096 0.22661096 0.22661096 0.22661096]
 [0.22634251 0.22634251 0.22634251 0.22634251 0.22634251]
 [0.23190508 0.23190508 0.23190508 0.23190508 0.23190508]] 

[[0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
 [0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
 [0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
 [0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
 [0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]]

Finally, let’s bring that all together into the single formula to calculate the Jacobian derivative of the Softmax function is:

np.diag(S) - (S_matrix * np.transpose(S_matrix))

Example output:

array([[ 0.13579568, -0.02480839, -0.03672428, -0.03668077, -0.03758224],
       [-0.02480839,  0.12964841, -0.03469023, -0.03464913, -0.03550067],
       [-0.03672428, -0.03469023,  0.17525843, -0.05129169, -0.05255223],
       [-0.03668077, -0.03464913, -0.05129169,  0.17511158, -0.05248998],
       [-0.03758224, -0.03550067, -0.05255223, -0.05248998,  0.17812512]])

So that’s the Softmax function and it’s derivative.

In the next part, we’ll look at how to combine this derivative with the summation function, common in artificial neural networks.