The Softmax Function Derivative (Part 2)

In a previous post, I showed how to calculate the derivative of the Softmax function. This function is widely used in Artificial Neural Networks, typically in final layer in order to estimate the probability that the network’s input is in one of a number of classes.

In this post, I’ll show how to calculate the derivative of the whole Softmax Layer rather than just the function itself.

The Python code is based on the excellent article by Eli Bendersky which can be found here.

The Softmax Layer

A Softmax Layer in an Artificial Neural Network is typically composed of two functions. The first is the usual sum of all the weighted inputs to the layer. The output of this is then fed into the Softmax function which will output the probability distribution across the classes we are trying to predict. Here’s an example with three inputs and five classes:

For a given output zi, the calculation is very straightforward:


We simply multiply each input to the node by it’s corresponding weight. Expressing this in vector notation gives us the familiar:


The vector w is two dimensional so it’s actually a matrix and we can visualise the formula for our example as follows:

I’ve already covered the Softmax Function itself in the previous post, so I’ll just repeat it here for completeness:

\sigma(z_{i})=\frac{e^{z_{i}}} {\sum_{j=1}^K e^{z_{j}}}

Here’s the python code for that:

import numpy as np

# input vector
x = np.array([0.1,0.5,0.4])

# using some hard coded values for the weights
# rather than random numbers to illustrate how 
# it works
W = np.array([[0.1, 0.2, 0.3, 0.4, 0.5],
             [0.6, 0.7, 0.8, 0.9, 0.1],
             [0.11, 0.12, 0.13, 0.14, 0.15]])

# Softmax function
def softmax(Z):
    eZ = np.exp(Z)
    sm = eZ / np.sum(eZ)
    return sm

Z =, x)
h = softmax(Z)

Which should give us the output h (the hypothesis):

[0.19091352 0.20353145 0.21698333 0.23132428 0.15724743]

Calculating the Derivative

The Softmax layer is a combination of two functions, the summation followed by the Softmax function itself. Mathematically, this is usually written as:

h = S(Z(x))

The next thing to note is that we will be trying to calculate the change in the hypothesis h with respect to changes in the weights, not the inputs. The overall derivative of the layer that we are looking for is:

h' =  \frac{d\textbf{S}}{d\textbf{w}}

We can use the differential chain rule to calculate the derivative of the layer as follows:

\frac{d\textbf{S}}{d\textbf{w}} = \frac{d\textbf{S}}{d\textbf{Z}} \cdot  \frac{d\textbf{Z}}{d\textbf{w}}

In the previous post, I showed how to work out dS/dZ and just for completeness, here is a short Python function to carry out the calculation:

def sm_dir(S):
    S_vector = S.reshape(S.shape[0],1)
    S_matrix = np.tile(S_vector,S.shape[0])
    S_dir = np.diag(S) - (S_matrix * np.transpose(S_matrix))
    return S_dir

DS = sm_dir(h)

The output of that function is a matrix as follows:

[[ 0.154465 -0.038856 -0.041425 -0.044162 -0.030020]
 [-0.038856  0.162106 -0.044162 -0.047081 -0.032004]
 [-0.041425 -0.044162 0.1699015 -0.050193 -0.034120]
 [-0.044162 -0.047081 -0.050193  0.177813 -0.036375]
 [-0.030020 -0.032004 -0.034120 -0.036375  0.132520]]

Derivative of Z

Let’s next look at the derivative of the function Z() with respect to W, dZ/dW. We are trying to find the change in each of the elements of Z(), zk when each of the weights wij are changed.

So right away, we are going to need a matrix to hold all of those values. Let’s assume that the output vector of Z() has K elements. There are (i \cdot j) individual weights in W. Therefore, our matrix of derivatives is going to be of dimensions (K, (i \cdot j)). Each of the elements of the matrix will be a partial derivative of the output zk with respect to the particular weight wij:

\frac{\delta{S}}{\delta{x}} = \left [ \begin{array}{ccc} \frac{\delta{z_{1}}}{\delta{w_{11}}} & \ldots & \frac{\delta{z_{1}}} {\delta{w_{53}}} \\  \ldots & \frac{\delta{z_{k}}}{\delta{w_{ij}}} & \ldots \\ \frac{\delta{z_{K}}}{\delta{w_{11}}} & \ldots & \frac{\delta{z_{K}}} {\delta{w_{53}}} \end{array} \right ]

Taking one of those elements, using our example above, we can see how to work out the derivative:

z_{1} = w_{11}x_{1} + w_{12}x_{2} + w_{13}x_{3}

None of the other weights are used in z1. The partial derivative of z1 with respect to w11 is x1. Likewise, the partial derivative of z1 with respect to w12 is x2, and with respect to w13 is x3. The derivative of z1 with respect to the rest of the weights is 0.

This makes the whole matrix rather simple to derive, since it is mostly zeros. Where the elements are not zero (i.e. where i = k), then the value is xj. Here is the corresponding Python code to calculate that matrix.

# derivative of the Summation Function Z w.r.t weight matrix W given inputs x

def z_dir(Z, W, x):
    dir_matrix = np.zeros((W.shape[0] * W.shape[1], Z.shape[0]))
    for k in range(0, Z.shape[0]):
        for i in range(0, W.shape[1]):
            for j in range(0, W.shape[0]):
                if i == k:
                    dir_matrix[(i*W.shape[0]) + j][k] = x[j]
    return dir_matrix

If we use the example above, then the derivative matrix will look like this:

DZ = z_dir(Z, W, x)

[[0.1 0.  0.  0.  0. ]
 [0.5 0.  0.  0.  0. ]
 [0.4 0.  0.  0.  0. ]
 [0.  0.1 0.  0.  0. ]
 [0.  0.5 0.  0.  0. ]
 [0.  0.4 0.  0.  0. ]
 [0.  0.  0.1 0.  0. ]
 [0.  0.  0.5 0.  0. ]
 [0.  0.  0.4 0.  0. ]
 [0.  0.  0.  0.1 0. ]
 [0.  0.  0.  0.5 0. ]
 [0.  0.  0.  0.4 0. ]
 [0.  0.  0.  0.  0.1]
 [0.  0.  0.  0.  0.5]
 [0.  0.  0.  0.  0.4]]

Going back to the formula for the derivative of the Softmax Layer:

\frac{d\textbf{S}}{d\textbf{W}} = \frac{d\textbf{S}}{d\textbf{Z}} \cdot \frac{d\textbf{Z}}{d\textbf{W}}

We now just take the dot product of both of the derivative matrices to get the derivative for the whole layer:

DL =, np.transpose(DZ))

[[ 0.01544  0.07723  0.06178 -0.00388 -0.01942 -0.01554
  -0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766
  -0.00300 -0.01501 -0.01200]
 [-0.00388 -0.01942 -0.01554  0.01621  0.0810   0.06484
  -0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883
  -0.00320 -0.01600 -0.01280]
 [-0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766
   0.01699  0.08495  0.06796 -0.00501 -0.02509 -0.02007
  -0.00341 -0.01706 -0.01364]
 [-0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883
  -0.00501 -0.02509 -0.02007  0.01778  0.08890  0.07112
  -0.00363 -0.01818 -0.01455]
 [-0.00300 -0.01501 -0.01200 -0.00320 -0.01600 -0.01280
  -0.00341 -0.01706 -0.01364 -0.00363 -0.01818 -0.01455
   0.01325  0.06626  0.05300]]


While it is instructive to see the matrices being derived explicitly, it is possible to manipulate the formulas to make it easier. Starting with one of the entries in the matrix DL, it looks like this:

\frac{\delta{s_{t}}}{\delta{w_{ij}}} = \sum_{k=1}^K D_{k}S_{t} \cdot D_{ij}Z_{k}

Since the matrix dZ/dW is mostly zeros, then we can try to simplify it. dZ/dW is non-zero when i = k, and then it is equal to xj as we worked out above. So we can simplify the non-zero entries to:

\frac{\delta{s_{t}}}{\delta{w_{ij}}} = D_{i}S_{t}x_{j}

In the previous post, we established that when the indices are the same, then:

D_{i}S_{j} = S_{j}(1-S_{i})


\frac{\delta{s_{t}}}{\delta{w_{ij}}} = S_{t}(1-S_{i})x_{j}

When the indices are not the same, we use:

\frac{\delta{s_{t}}}{\delta{w_{ij}}} = S_{t}(0-S_{i})x_{j}

What these two formulas show is that it is possible to calculate each of the entries in the derivative matrix by using only the input values X and the Softmax output S, skipping the matrix dot product altogether.

Here is the Python code corresponding to that:

def l_dir_shortcut(W, S, x):
    dir_matrix = np.zeros((W.shape[0] * W.shape[1], W.shape[1]))
    for t in range(0, W.shape[1]):
        for i in range(0, W.shape[1]):
            for j in range(0, W.shape[0]):
                dir_matrix[(i*W.shape[0]) + j][t] = S[t] * ((i==t) - S[i]) * x[j]
    return dir_matrix

DL_shortcut = np.transpose(l_dir_shortcut(W, h, x))

To verify that, we can cross check it with the matrix we derived from first principle:


[[ 0.01544  0.07723  0.06178 -0.00388 -0.01942 -0.01554
  -0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766
  -0.00300 -0.01501 -0.01200]
 [-0.00388 -0.01942 -0.01554  0.01621  0.08105  0.06484
  -0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883
  -0.00320 -0.01600 -0.01280]
 [-0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766
   0.01699  0.08495  0.06796 -0.00501 -0.02509 -0.02007
  -0.00341 -0.01706 -0.01364]
 [-0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883
  -0.00501 -0.02509 -0.02007  0.01778  0.08890  0.07112
  -0.00363 -0.01818 -0.01455]
 [-0.00300 -0.01501 -0.01200 -0.00320 -0.01600 -0.01280
  -0.00341 -0.01706 -0.01364 -0.00363 -0.01818 -0.01455
   0.01325  0.06626  0.05300]]

Lastly, it’s worth noting that in order to actually modify each of the weights, we need to sum up the individual adjustments in each of the corresponding columns.

XOR Revisited: Keras and TensorFlow

A few weeks ago, it was announced that Keras would be getting official Google support and would become part of the TensorFlow machine learning library. Keras is a collection of high-level APIs in Python for creating and training neural networks, using either Theano or TensorFlow as the underlying engine.

Given my previous posts on implementing an XOR-solving neural network in a variety of different languages and tools, I thought it was time to see what it would look like in Keras.

XOR can be expressed as a classification problem that is best illustrated in a diagram. The goal is to create a neural network that will correctly predict the values 0 or 1, depending on the inputs x1 and x2 as shown.

xor graph

The neural network that is capable of being trained to solve that problem looks like this:


If you’d like to understand why this is the case, have a look at the detailed explanation in the posts implementing the solution in Octave.

So how does this look in Keras? Well it’s rather simple. Assuming you’ve already installed Keras, we’ll start with setting up the classification problem and the expected outputs:

import numpy as np

x = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([[0], [1], [1], [0]])

So far, so good. We’re using numpy arrays to store our inputs (x) and outputs (y). Now for the neural network definition:

from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(2, input_shape=(2,)))

The Sequential model is simply a sequence of layers making up the network. Our diagram above has a set of inputs being fed into two processing layers. We’ve already defined the inputs, so all we need to do is add the other two layers.

In Keras, we’ll use Dense layers, which simply means they are is fully connected. The parameters indicate that the first layer has two nodes and the second layer has one node, corresponding to the diagram above.

The first layer also has the shape of the inputs which in this case is a one-dimensional vector with 2 elements. The second layer’s inputs will be inferred from the first layer.

We then add an Activation of type ‘sigmoid’ to each layer, again matching our neural network definition.

Note that Keras looks after the bias input without us having to explicitly code for it. In addition, Keras also looks after the weights (Θ1 and Θ2). This makes our neural network definition really straightforward and shows the benefits of using a high-level abstraction.

Finally, we apply a loss function and learning mode for Keras to be able to adjust the neural network:

model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy'])

In this example, we’ll use the standard Mean Squared Error loss function and Stochastic Gradient Descent optimiser. And that’s it for the network definition.

If you want to see that the network looks like, use:


The network should look like this:

>> model.summary()
Layer (type)                  Output Shape         Param # 
dense_1 (Dense)               (None, 2)            6 
activation_1 (Activation)     (None, 2)            0 
dense_2 (Dense)               (None, 1)            3 
activation_2 (Activation)     (None, 1)            0 
Total params: 9
Trainable params: 9
Non-trainable params: 0

Now we just need to kick off the training of the network.,y, epochs=100000, batch_size=4)

All going well, the network weights will converge on a solution that can correctly classify the inputs (if not, you may need to up the number of epochs):

>>> model.predict(x, verbose=1)

4/4 [==============================] - 0s

array([[ 0.07856689],

       [ 0.91362464],

       [ 0.92543262],

       [ 0.06886736]], dtype=float32)


Clearly this network is on it’s way to converging on the original expected outputs we defined above (y).

So that’s all there is to a Keras version of the XOR-solving neural network. The fact that it is using TensorFlow as the engine is completely hidden and that makes implementing the network a lot simpler.


In Praise Of Reinventing The Wheel

The original Ferris Wheel at the 1893 World Columbian Exposition in Chicago (Source: Public Domain via Wikipedia)


There is a strong principle in software engineering of reusing code wherever possible. It’s considered so important, that it’s got its own TLA: DRY (“Don’t Repeat Yourself”). In other words, don’t reinvent the wheel.

There are obvious benefits to this. Nobody wants to type the same piece of code over and over. So write it once as a function and reuse it wherever you need to, freeing yourself up to build other stuff. Better still, reuse someone else’s functions.

Indeed, the ready availability of open-source libraries and frameworks for almost every popular requirement under the sun, means that you almost never have to develop your own low level software. Thank goodness. Software development would get really tedious if you had to start out writing basic I/O routines with each new project.

Even new developers can get sophisticated projects up and running fairly quickly. Github and StackOverflow have seen to that.

The principle is taken to an extreme by some web developers, who include the entire jQuery library on their webpage in order to implement some simple form validation. I’m all in favour of simplifying JavaScript development, second only to abolishing it altogether. But including the whole library is a tad overkill.

And this highlights a problem with treating DRY as dogma.

When we rely on existing libraries and frameworks, we don’t get a good idea of what’s really happening within our code base. Yes, we can read the library code to get the gist of what’s going on. But we never really do.

There’s a really good reason to try to implement your own routines, if only in your test projects. You’ll get an appreciation of the lower level algorithms that your applications depend on. You may even discover how difficult it is to get some of these algorithms to run quickly or to use a small amount of memory.

This will lead you to the core principles of algorithmic analysis that are fundamental to computer science. Along the way, you’ll pick up a really good sense of what works well and what doesn’t, what’s fast and what’s slow, what’s memory efficient and what’s a complete hog. Hopefully, you can reflect those lessons in your own production application code.

So the next time you have a chance, try to implement your own I/O routines or other low level algorithms and see how you compare to other published libraries.

Remember, DRY is a principle, not a dogma.

Sometimes there is merit in repeating yourself (or someone else).


Using Xcode with Github

You’ve found a nice open-source project you want to play with on GitHub. You’ve cloned it to your own repository and use Xcode 7 as your development environment. How do you make Xcode and GitHub play nicely with each other?

Turns out that Xcode has some nice features built in so that you can work directly with your GitHub-based code. To get started, open up the Preferences pane under the Xcode menu. Select the “Source Control” tab:

Source Control Preferences

Make sure that the “Enable Source Control” option is checked. Then select the Accounts tab:

Github Accounts

Click on the “+” at the bottom of the pane on the left. Select “Add Repository”. The following pane has several fields that you need to fill in.

  • Address: This is the URL of the repository. You can get this by clicking on the green “Clone or Download” button on the GitHub website.
  • Type: Choose “Git”
  • Authentication: Choose “User Name and Password”
  • User Name: Enter your GitHub user name
  • Password: Enter your GitHub password

I’ve added Google’s Protobuf and my own clone of TensorFlow in the example above.

Close the Preferences pane and select the “Source Control” menu.

Source Control Menu

This menu contains the controls you need to manage branches, commits and Pull Requests as necessary.

Xcode also lets you compare versions of code. In the top right of the main editor window, there is an icon with two arrows. Click on that and select “Comparison”. Xcode will show you the current version of your code against one in a different branch. You can choose which branch to compare against by clicking on the branch symbol below each editor pane.

Xcode Compare

I’ve literally scratched the surface here with using GitHub in Xcode 7. But it looks like it’s a straightforward way to play with the many open-source projects hosted there.



2015 Underhanded C Contest Results Released

I’ve just found out that the results of the 2015 Underhanded C Contest have been published and (YIPPEE!!), I managed to bag a runner-up spot.

This year, the competition was sponsored by the Nuclear Threat Initiative. The goal of the contest was to fake the results of a test of a nuclear warhead to show that a warhead contained fissile material, when in fact it did not.

Congratulations to the winner, Linus Åkesson. You can read about his entry on his blog. It’s also worth reading about the other entrants’ ideas for how to hide malicious code inside normal looking code.

Here is the NTI’s article about the contest:

Lastly, if you’re curious about my entry, I’ve posted the code and an explanation on GitHub: