The Softmax Function Derivative (Part 3)

Previously I’ve shown how to work out the derivative of the Softmax Function combined with the summation function, typical in artificial neural networks.

In this final part, we’ll look at how the weights in a Softmax layer change in respect to a Loss Function. The Loss Function is a measure of how “bad” the estimate from the network is. We’ll then be modifying the weights in the network in order to improve the “Loss”, i.e. make it less bad.

The Python code is based on the excellent article by Eli Bendersky which can be found here.

Cross Entropy Loss Function

There are different kinds Cross Entropy functions depending on what kind of classification that you want your network to estimate. In this example, we’re going to use the Categorical Cross Entropy. This function is typically used when the network is required to estimate which class something belongs to, when there are many classes. The output of the Softmax Function is a vector of probabilities, each element represents the network’s estimate that the input is in that class. For example:

[0.19091352 0.20353145 0.21698333 0.23132428 0.15724743]

The first element, 0.19091352, represents the network’s estimate that the input is in the first class, and so on.

Usually, the input is in one class, and we can represent the correct class for an input as a one-hot vector. In other words, the class vector is all zeros, except for a 1 in the index corresponding to the class.

[0 0 1 0 0]

In this example, the input is in class 3, represented by a 1 in the third element.

The multi-class Cross Entropy Function is defined as follows:

-\sum_{c=1}^M=y_{o,c} \textup{ log}(S_{o,c})

where M is the number of classes, y is the one-hot vector representing the correct classification c for the observation o (i.e. the input). S is the Softmax output for the class c for the observation o. Here is some code to calculate that (which continues from my previous posts on this topic):

def x_entropy(y, S):
    return np.sum(-1 * y * np.log(S))

y = np.zeros(5)
y[2] = 1   # picking the third class for example purposes
xe = x_entropy(y, S)
print(xe)

1.5279347484961026

Cross Entropy Derivative

Just like the other derivatives we’ve looked at before, the Cross-Entropy derivative is a vector of partial derivatives with respect to it’s input:

\frac{\Delta XE}{\Delta S} = \left[  \frac{\delta XE}{\delta S_{1}} \frac{\delta XE}{\delta S_{2}} \ldots \frac{\delta XE}{\delta S_{t}}  \right]

We can make this a little simpler by observing that since Y (i.e. the ground truth classification vector) is zeros, except for the target class, c, then the Cross Entropy derivative vector is also going to be zeros, except for the class c.

To see why this is the case, let’s examine the Cross Entropy function itself. We calculate it by summing up a product. Each product is the value from Y multiplied by the log of the corresponding value from S. Since all the elements in Y are actually 0 (except for the target class, c), then the corresponding derivative will also be 0. No matter how much we change the values in S, the result will still be 0.

Therefore:

\frac{\Delta XE}{\Delta S} = \left[ \ldots \frac{\delta XE}{\delta S_{t}} \ldots \right]

We can rewrite this a little, expanding out the XE function:

\frac{\Delta XE}{\Delta S} = \left[ \ldots \frac{\delta -(Y_{c}\textup{log}(S_{c}))}{\delta S_{c}} \ldots \right]

We already know that Y_{c} is 1, so we are left with:

\frac{\Delta XE}{\Delta S} = \left[ \ldots \frac{\delta -\textup{log}(S_{c})}{\delta S_{c}} \ldots \right]

So we are just looking for the derivative of the log of S_{c}:

\frac{\Delta XE}{\Delta S} = \left[ \ldots -\frac{1}{S_{c}} \ldots \right]

The rest of the elements in the vector will be 0. Here is the code that works that out:

def xe_dir(y, S):
    return (-1 / S) * y

DXE = xe_dir(y, S)
print(DXE)

[-0.      -0.      -4.60864 -0.      -0.     ]

Bringing it all together

When we have a neural network layer, we want to change the weights in order to make the loss as small as possible. So we are trying to calculate:

\frac{\Delta XE}{\Delta W}

for each of the input instances X. Since XE is a function that depends on the Softmax function, which itself depends on the summation function in the neurons, we can use the calculus chain rule as follows:

\frac{\Delta XE}{\Delta W} = \frac{\Delta XE}{\Delta S} \cdot \frac{\Delta S}{\Delta Z} \cdot \frac{\Delta Z}{\Delta W}

In this post, we’ve calculated \frac{\Delta XE}{\Delta S} and in the previous posts, we calculated \frac{\Delta S}{\Delta Z} and \frac{\Delta Z}{\Delta W}. To calculate the overall changes to the weights, we simply carry out a dot product of all those matrices:

print(np.dot(DXE, DL_shortcut).reshape(W.shape))

[[ 0.01909135  0.09545676  0.07636541  0.02035314  0.10176572]
 [ 0.08141258 -0.07830167 -0.39150833 -0.31320667  0.02313243]
 [ 0.11566214  0.09252971  0.01572474  0.07862371  0.06289897]]

Shortcut

Now that we’ve seen how to calculate the individual parts of the derivative, we can now look to see if there is a shortcut that avoids all that matrix multiplication, especially since there are lots of zeros in the elements.

Previously, we had established that the elements in the matrix \frac{\Delta S}{\Delta W} can be calculated using:

\frac{\delta{S_{t}}}{\delta{W_{ij}}} = S_{t}(1-S_{i})x_{j}

where the input and output indices are the same, and

\frac{\delta{S_{t}}}{\delta{W_{ij}}} = S_{t}(0-S_{i})x_{j}

where they are different.

Using this result, we can see that an element in the derivative of the Cross Entropy function XE, with respect to the weights W is (swapping c for t):

\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = \frac{\delta{XE_{c}}}{\delta{S_{c}}} \cdot S_{c}(1-S_{i})x_{j}

We’ve shown above that the derivative of XE with respect to S is just -\frac{1}{S_{c}}. So each element in the derivative where i = c becomes:

\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = -\frac{1}{S_{c}} \cdot S_{c}(1-S_{i})x_{j}

This simplifies to:

\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = (S_{i}-1)x_{j}

Similarly, where i <> c:

\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = (S_{i})x_{j}

Here is the corresponding Python code for that:

def xe_dir_shortcut(W, S, x, y):
    dir_matrix = np.zeros((W.shape[0] * W.shape[1]))
    
    for i in range(0, W.shape[1]):
        for j in range(0, W.shape[0]):
            dir_matrix[(i*W.shape[0]) + j] = (S[i] - y[i]) * x[j]
                
    return dir_matrix

delta_w = xe_dir_shortcut(W, h, x, y)

Let’s verify that this gives us the same results as the longer matrix multiplication above:

print(delta_w.reshape(W.shape))

[[ 0.01909135  0.09545676  0.07636541  0.02035314  0.10176572]
 [ 0.08141258 -0.07830167 -0.39150833 -0.31320667  0.02313243]
 [ 0.11566214  0.09252971  0.01572474  0.07862371  0.06289897]]

Now we have a simple function that will calculate the changes to the weights for a seemingly complicated single-layer of a neural network.