In a previous post, I showed how to calculate the derivative of the Softmax function. This function is widely used in Artificial Neural Networks, typically in final layer in order to estimate the probability that the network’s input is in one of a number of classes.
In this post, I’ll show how to calculate the derivative of the whole Softmax Layer rather than just the function itself.
The Python code is based on the excellent article by Eli Bendersky which can be found here.
The Softmax Layer

For a given output zi, the calculation is very straightforward:
We simply multiply each input to the node by it’s corresponding weight. Expressing this in vector notation gives us the familiar:
The vector w is two dimensional so it’s actually a matrix and we can visualise the formula for our example as follows:

I’ve already covered the Softmax Function itself in the previous post, so I’ll just repeat it here for completeness:
Here’s the python code for that:
import numpy as np
# input vector
x = np.array([0.1,0.5,0.4])
# using some hard coded values for the weights
# rather than random numbers to illustrate how
# it works
W = np.array([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.6, 0.7, 0.8, 0.9, 0.1],
[0.11, 0.12, 0.13, 0.14, 0.15]])
# Softmax function
def softmax(Z):
eZ = np.exp(Z)
sm = eZ / np.sum(eZ)
return sm
Z = np.dot(np.transpose(W), x)
h = softmax(Z)
print(h)
Which should give us the output h (the hypothesis):
[0.19091352 0.20353145 0.21698333 0.23132428 0.15724743]
Calculating the Derivative
The Softmax layer is a combination of two functions, the summation followed by the Softmax function itself. Mathematically, this is usually written as:
The next thing to note is that we will be trying to calculate the change in the hypothesis h with respect to changes in the weights, not the inputs. The overall derivative of the layer that we are looking for is:
We can use the differential chain rule to calculate the derivative of the layer as follows:
In the previous post, I showed how to work out dS/dZ and just for completeness, here is a short Python function to carry out the calculation:
def sm_dir(S):
S_vector = S.reshape(S.shape[0],1)
S_matrix = np.tile(S_vector,S.shape[0])
S_dir = np.diag(S) - (S_matrix * np.transpose(S_matrix))
return S_dir
DS = sm_dir(h)
print(DS)
The output of that function is a matrix as follows:
[[ 0.154465 -0.038856 -0.041425 -0.044162 -0.030020]
[-0.038856 0.162106 -0.044162 -0.047081 -0.032004]
[-0.041425 -0.044162 0.1699015 -0.050193 -0.034120]
[-0.044162 -0.047081 -0.050193 0.177813 -0.036375]
[-0.030020 -0.032004 -0.034120 -0.036375 0.132520]]
Derivative of Z
Let’s next look at the derivative of the function Z() with respect to W, dZ/dW. We are trying to find the change in each of the elements of Z(), zk when each of the weights wij are changed.
So right away, we are going to need a matrix to hold all of those values. Let’s assume that the output vector of Z() has K elements. There are (i j) individual weights in W. Therefore, our matrix of derivatives is going to be of dimensions (K, (i
j)). Each of the elements of the matrix will be a partial derivative of the output zk with respect to the particular weight wij:
Taking one of those elements, using our example above, we can see how to work out the derivative:
None of the other weights are used in z1. The partial derivative of z1 with respect to w11 is x1. Likewise, the partial derivative of z1 with respect to w12 is x2, and with respect to w13 is x3. The derivative of z1 with respect to the rest of the weights is 0.
This makes the whole matrix rather simple to derive, since it is mostly zeros. Where the elements are not zero (i.e. where i = k), then the value is xj. Here is the corresponding Python code to calculate that matrix.
# derivative of the Summation Function Z w.r.t weight matrix W given inputs x
def z_dir(Z, W, x):
dir_matrix = np.zeros((W.shape[0] * W.shape[1], Z.shape[0]))
for k in range(0, Z.shape[0]):
for i in range(0, W.shape[1]):
for j in range(0, W.shape[0]):
if i == k:
dir_matrix[(i*W.shape[0]) + j][k] = x[j]
return dir_matrix
If we use the example above, then the derivative matrix will look like this:
DZ = z_dir(Z, W, x)
print(DZ)
[[0.1 0. 0. 0. 0. ]
[0.5 0. 0. 0. 0. ]
[0.4 0. 0. 0. 0. ]
[0. 0.1 0. 0. 0. ]
[0. 0.5 0. 0. 0. ]
[0. 0.4 0. 0. 0. ]
[0. 0. 0.1 0. 0. ]
[0. 0. 0.5 0. 0. ]
[0. 0. 0.4 0. 0. ]
[0. 0. 0. 0.1 0. ]
[0. 0. 0. 0.5 0. ]
[0. 0. 0. 0.4 0. ]
[0. 0. 0. 0. 0.1]
[0. 0. 0. 0. 0.5]
[0. 0. 0. 0. 0.4]]
Going back to the formula for the derivative of the Softmax Layer:
We now just take the dot product of both of the derivative matrices to get the derivative for the whole layer:
DL = np.dot(DS, np.transpose(DZ))
print(DL)
[[ 0.01544 0.07723 0.06178 -0.00388 -0.01942 -0.01554
-0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766
-0.00300 -0.01501 -0.01200]
[-0.00388 -0.01942 -0.01554 0.01621 0.0810 0.06484
-0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883
-0.00320 -0.01600 -0.01280]
[-0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766
0.01699 0.08495 0.06796 -0.00501 -0.02509 -0.02007
-0.00341 -0.01706 -0.01364]
[-0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883
-0.00501 -0.02509 -0.02007 0.01778 0.08890 0.07112
-0.00363 -0.01818 -0.01455]
[-0.00300 -0.01501 -0.01200 -0.00320 -0.01600 -0.01280
-0.00341 -0.01706 -0.01364 -0.00363 -0.01818 -0.01455
0.01325 0.06626 0.05300]]
Shortcut!
While it is instructive to see the matrices being derived explicitly, it is possible to manipulate the formulas to make it easier. Starting with one of the entries in the matrix DL, it looks like this:
Since the matrix dZ/dW is mostly zeros, then we can try to simplify it. dZ/dW is non-zero when i = k, and then it is equal to xj as we worked out above. So we can simplify the non-zero entries to:
In the previous post, we established that when the indices are the same, then:
So:
When the indices are not the same, we use:
What these two formulas show is that it is possible to calculate each of the entries in the derivative matrix by using only the input values X and the Softmax output S, skipping the matrix dot product altogether.
Here is the Python code corresponding to that:
def l_dir_shortcut(W, S, x):
dir_matrix = np.zeros((W.shape[0] * W.shape[1], W.shape[1]))
for t in range(0, W.shape[1]):
for i in range(0, W.shape[1]):
for j in range(0, W.shape[0]):
dir_matrix[(i*W.shape[0]) + j][t] = S[t] * ((i==t) - S[i]) * x[j]
return dir_matrix
DL_shortcut = np.transpose(l_dir_shortcut(W, h, x))
To verify that, we can cross check it with the matrix we derived from first principle:
print(DL_shortcut)
[[ 0.01544 0.07723 0.06178 -0.00388 -0.01942 -0.01554
-0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766
-0.00300 -0.01501 -0.01200]
[-0.00388 -0.01942 -0.01554 0.01621 0.08105 0.06484
-0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883
-0.00320 -0.01600 -0.01280]
[-0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766
0.01699 0.08495 0.06796 -0.00501 -0.02509 -0.02007
-0.00341 -0.01706 -0.01364]
[-0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883
-0.00501 -0.02509 -0.02007 0.01778 0.08890 0.07112
-0.00363 -0.01818 -0.01455]
[-0.00300 -0.01501 -0.01200 -0.00320 -0.01600 -0.01280
-0.00341 -0.01706 -0.01364 -0.00363 -0.01818 -0.01455
0.01325 0.06626 0.05300]]
Lastly, it’s worth noting that in order to actually modify each of the weights, we need to sum up the individual adjustments in each of the corresponding columns.