### Introduction

This post demonstrates the calculations behind the evaluation of the Softmax Derivative using Python. It is based on the excellent article by Eli Bendersky which can be found here.

### The Softmax Function

The softmax function simply takes a vector of N dimensions and returns a probability distribution also of N dimensions. Each element of the output is in the range (0,1) and the sum of the elements of N is 1.0.

Each element of the output is given by the formula:

See https://en.wikipedia.org/wiki/Softmax_function for more details.

import numpy as np x = np.random.random([5]) def softmax_basic(z): exps = np.exp(z) sums = np.sum(exps) return np.divide(exps, sums) softmax_basic(x)

This should generate an output that looks something like this:

array([0.97337094, 0.85251098, 0.62495691, 0.63957056, 0.6969253 ])

We expect that the sum of those will be (close to) 1.0:

np.sum(softmax_basic(x))

And so it is:

1.0

### Calculating the derivative

We need to calculate the partial derivative of the probability outputs with respect to each of the inputs . For example:

Following Bendersky’s derivation, we need to use the quotient rule for derivatives:

From the Softmax function:

The derivatives of these functions with respect to are:

and

Now we have to evalutate the quotient rule for the two seperate cases where and where :

Starting with :

Now we’ll just simplify this a bit:

For the case where :

Here are some examples of those function:

S = softmax_basic(x) # where i = j div = S[0] * (1 - S[0]) print("S[0] :", S[0], "div :", div) div = S[1] * (1 - S[1]) print("S[1] :", S[1], "div :", div) div = S[2] * (1 - S[2]) print("S[2] :", S[2], "div :", div) # where i != j div = 0 - (S[0] * S[1]) print("S[0] :", S[0], "S[1] :", S[1], "div :", div) div = 0 - (S[1] * S[2]) print("S[1] :", S[1], "S[2] :", S[2], "div :", div) div = 0 - (S[2] * S[3]) print("S[2] :", S[2], "S[3] :", S[3], "div :", div)

Here’s the output (yours may have different numbers):

S[0] : 0.16205870737014744 div : 0.13579568273566436 S[1] : 0.1530827392322831 div : 0.12964841418142392 S[2] : 0.22661095862407418 div : 0.1752584320555523 S[0] : 0.16205870737014744 S[1] : 0.1530827392322831 div : -0.024808390840665155 S[1] : 0.1530827392322831 S[2] : 0.22661095862407418 div : -0.034690226286226845 S[2] : 0.22661095862407418 S[3] : 0.226342511074585 div : -0.051291693411991836

Let’s tidy up the formula a little so we can come up with a decent implementation in the code.

Looking at the case where :

And where

We can generalise that formula by calculating the Jacobian matrix. This matrix will look like this:

The matrix on the left is simply the vector S laid out along a diagonal. Numpy provides a diag function to do this for us:

np.diag(S)

And that generates a matrix for us:

array([[0.16205871, 0. , 0. , 0. , 0. ], [0. , 0.15308274, 0. , 0. , 0. ], [0. , 0. , 0.22661096, 0. , 0. ], [0. , 0. , 0. , 0.22634251, 0. ], [0. , 0. , 0. , 0. , 0.23190508]])

The matrix on the right is comprised of every element in S multiplied by all the elements in S. So we can create the first matrix by repeating S in rows, create the second matrix by repeating S in columns (i.e. transposing the first matrix) and finally, multiplying them together element-wise. Numpy comes to our help again by providing a tile function:

S_vector = S.reshape(S.shape[0],1) S_matrix = np.tile(S_vector,S.shape[0]) print(S_matrix, '\n') print(np.transpose(S_matrix))

And the output of this (again, yours will have different numbers due to the random initialisation):

[[0.16205871 0.16205871 0.16205871 0.16205871 0.16205871] [0.15308274 0.15308274 0.15308274 0.15308274 0.15308274] [0.22661096 0.22661096 0.22661096 0.22661096 0.22661096] [0.22634251 0.22634251 0.22634251 0.22634251 0.22634251] [0.23190508 0.23190508 0.23190508 0.23190508 0.23190508]] [[0.16205871 0.15308274 0.22661096 0.22634251 0.23190508] [0.16205871 0.15308274 0.22661096 0.22634251 0.23190508] [0.16205871 0.15308274 0.22661096 0.22634251 0.23190508] [0.16205871 0.15308274 0.22661096 0.22634251 0.23190508] [0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]]

Finally, let’s bring that all together into the single formula to calculate the Jacobian derivative of the Softmax function is:

np.diag(S) - (S_matrix * np.transpose(S_matrix))

Example output:

array([[ 0.13579568, -0.02480839, -0.03672428, -0.03668077, -0.03758224], [-0.02480839, 0.12964841, -0.03469023, -0.03464913, -0.03550067], [-0.03672428, -0.03469023, 0.17525843, -0.05129169, -0.05255223], [-0.03668077, -0.03464913, -0.05129169, 0.17511158, -0.05248998], [-0.03758224, -0.03550067, -0.05255223, -0.05248998, 0.17812512]])

So that’s the Softmax function and it’s derivative.

In the next part, we’ll look at how to combine this derivative with the summation function, common in artificial neural networks.