Solving MNIST with a Neural Network from the ground up

Note: Here’s the Python source code for this project in a Jupyter notebook on GitHub

I’ve written before about the benefits of reinventing the wheel and this is one of those occasions where it was definitely worth the effort. Sometimes, there is just no substitute for trying to implement an algorithm to really understand what’s going on under the hood. This is especially true when learning about artificial neural networks. Sure, there are plenty of frameworks available that you can use which implement any flavour of neural network, complete with a dazzling arrays of optimisations, activations and loss functions. That may solve your problem, but it abstracts away a lot of the details about why it solves it.

MNIST is a great dataset to start with. It’s a collection of images containing 60,000 handwritten digits. It also contains a further 10,000 images that can be used as the test set. It’s been well studied and most frameworks have sample implementations. Here’s an example image:

You can find the full dataset of images on Yann Le Cun’s website.

While it’s useful to reinvent the wheel, we should at least learn from those that have already built wheels before. The first thing I borrowed was the network architecture from TensorFlow. Their example has:

  • 28×28 input
  • a hidden layer with 512 neurons with ReLU activation
  • an output layer with 10 neutrons (representing the 10 possible digits) with Softmax activation
  • Cross-Entropy loss function

The next thing to work on was the feedforward part of the network. This is relatively straightforward as these functions are well documented online and the network itself isn’t complicated.

The tough part was working through the back-propagation algorithm. In a previous post, I detailed how to work out the derivatives of the Softmax function and the Cross Entropy loss. The most obvious way is to use the Chain Rule in Differential Calculus to work out the gradients and propagate them back through the network. The steps are pleasing to my eye and appeal to my sense of order in code. (Tip: Use a spreadsheet on a small example network to see the actual matrices in action.)

But (and it’s a big but), the basic approach uses Jacobian matrices. Each cell in these kind of matrices is a partial derivative; each matrix represents a change in every variable with respect to every output. As a result, they can grow rather large very quickly. We run into several issues multiplying very large matrices together. In the notebook, I’ve left the functions representing this approach in for comparison and if you do run it, you’ll notice immediately the problems with speed and memory.

Luckily there are shortcuts, which mean that we can directly calculate the gradients without resorting to Jacobian matrix multiplication. You can see these in the Short Form section of the notebook. In a sense though, these are abstractions too and it’s difficult to see the back-propagation from the shortcut methods.

Lastly, I’ve implemented some code to gather the standard metrics for evaluating how good a machine learning model is. I’ve run it several times and it usually gets an overall accuracy score of between 92% and 95% on the MNIST test dataset.

One of the main things I learned from this exercise is that the actual coding of a network is relatively simple. The really hard part that took a while was figuring out the calculus and especially the shortcuts. I really appreciate now why those frameworks are popular and make coding neural networks so much easier.

If you fancy a challenge, I can recommend working on a neural network from first principles. You never know what you might learn!


The Softmax Function Derivative (Part 3)

Previously I’ve shown how to work out the derivative of the Softmax Function combined with the summation function, typical in artificial neural networks.

In this final part, we’ll look at how the weights in a Softmax layer change in respect to a Loss Function. The Loss Function is a measure of how “bad” the estimate from the network is. We’ll then be modifying the weights in the network in order to improve the “Loss”, i.e. make it less bad.

The Python code is based on the excellent article by Eli Bendersky which can be found here.

Cross Entropy Loss Function

There are different kinds Cross Entropy functions depending on what kind of classification that you want your network to estimate. In this example, we’re going to use the Categorical Cross Entropy. This function is typically used when the network is required to estimate which class something belongs to, when there are many classes. The output of the Softmax Function is a vector of probabilities, each element represents the network’s estimate that the input is in that class. For example:

[0.19091352 0.20353145 0.21698333 0.23132428 0.15724743]

The first element, 0.19091352, represents the network’s estimate that the input is in the first class, and so on.

Usually, the input is in one class, and we can represent the correct class for an input as a one-hot vector. In other words, the class vector is all zeros, except for a 1 in the index corresponding to the class.

[0 0 1 0 0]

In this example, the input is in class 3, represented by a 1 in the third element.

The multi-class Cross Entropy Function is defined as follows:

-\sum_{c=1}^M=y_{o,c} \textup{ log}(S_{o,c})

where M is the number of classes, y is the one-hot vector representing the correct classification c for the observation o (i.e. the input). S is the Softmax output for the class c for the observation o. Here is some code to calculate that (which continues from my previous posts on this topic):

def x_entropy(y, S):
    return np.sum(-1 * y * np.log(S))

y = np.zeros(5)
y[2] = 1   # picking the third class for example purposes
xe = x_entropy(y, S)


Cross Entropy Derivative

Just like the other derivatives we’ve looked at before, the Cross-Entropy derivative is a vector of partial derivatives with respect to it’s input:

\frac{\Delta XE}{\Delta S} = \left[  \frac{\delta XE}{\delta S_{1}} \frac{\delta XE}{\delta S_{2}} \ldots \frac{\delta XE}{\delta S_{t}}  \right]

We can make this a little simpler by observing that since Y (i.e. the ground truth classification vector) is zeros, except for the target class, c, then the Cross Entropy derivative vector is also going to be zeros, except for the class c.

To see why this is the case, let’s examine the Cross Entropy function itself. We calculate it by summing up a product. Each product is the value from Y multiplied by the log of the corresponding value from S. Since all the elements in Y are actually 0 (except for the target class, c), then the corresponding derivative will also be 0. No matter how much we change the values in S, the result will still be 0.


\frac{\Delta XE}{\Delta S} = \left[ \ldots \frac{\delta XE}{\delta S_{t}} \ldots \right]

We can rewrite this a little, expanding out the XE function:

\frac{\Delta XE}{\Delta S} = \left[ \ldots \frac{\delta -(Y_{c}\textup{log}(S_{c}))}{\delta S_{c}} \ldots \right]

We already know that Y_{c} is 1, so we are left with:

\frac{\Delta XE}{\Delta S} = \left[ \ldots \frac{\delta -\textup{log}(S_{c})}{\delta S_{c}} \ldots \right]

So we are just looking for the derivative of the log of S_{c}:

\frac{\Delta XE}{\Delta S} = \left[ \ldots -\frac{1}{S_{c}} \ldots \right]

The rest of the elements in the vector will be 0. Here is the code that works that out:

def xe_dir(y, S):
    return (-1 / S) * y

DXE = xe_dir(y, S)

[-0.      -0.      -4.60864 -0.      -0.     ]

Bringing it all together

When we have a neural network layer, we want to change the weights in order to make the loss as small as possible. So we are trying to calculate:

\frac{\Delta XE}{\Delta W}

for each of the input instances X. Since XE is a function that depends on the Softmax function, which itself depends on the summation function in the neurons, we can use the calculus chain rule as follows:

\frac{\Delta XE}{\Delta W} = \frac{\Delta XE}{\Delta S} \cdot \frac{\Delta S}{\Delta Z} \cdot \frac{\Delta Z}{\Delta W}

In this post, we’ve calculated \frac{\Delta XE}{\Delta S} and in the previous posts, we calculated \frac{\Delta S}{\Delta Z} and \frac{\Delta Z}{\Delta W}. To calculate the overall changes to the weights, we simply carry out a dot product of all those matrices:

print(, DL_shortcut).reshape(W.shape))

[[ 0.01909135  0.09545676  0.07636541  0.02035314  0.10176572]
 [ 0.08141258 -0.07830167 -0.39150833 -0.31320667  0.02313243]
 [ 0.11566214  0.09252971  0.01572474  0.07862371  0.06289897]]


Now that we’ve seen how to calculate the individual parts of the derivative, we can now look to see if there is a shortcut that avoids all that matrix multiplication, especially since there are lots of zeros in the elements.

Previously, we had established that the elements in the matrix \frac{\Delta S}{\Delta W} can be calculated using:

\frac{\delta{S_{t}}}{\delta{W_{ij}}} = S_{t}(1-S_{i})x_{j}

where the input and output indices are the same, and

\frac{\delta{S_{t}}}{\delta{W_{ij}}} = S_{t}(0-S_{i})x_{j}

where they are different.

Using this result, we can see that an element in the derivative of the Cross Entropy function XE, with respect to the weights W is (swapping c for t):

\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = \frac{\delta{XE_{c}}}{\delta{S_{c}}} \cdot S_{c}(1-S_{i})x_{j}

We’ve shown above that the derivative of XE with respect to S is just -\frac{1}{S_{c}}. So each element in the derivative where i = c becomes:

\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = -\frac{1}{S_{c}} \cdot S_{c}(1-S_{i})x_{j}

This simplifies to:

\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = (S_{i}-1)x_{j}

Similarly, where i <> c:

\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = (S_{i})x_{j}

Here is the corresponding Python code for that:

def xe_dir_shortcut(W, S, x, y):
    dir_matrix = np.zeros((W.shape[0] * W.shape[1]))
    for i in range(0, W.shape[1]):
        for j in range(0, W.shape[0]):
            dir_matrix[(i*W.shape[0]) + j] = (S[i] - y[i]) * x[j]
    return dir_matrix

delta_w = xe_dir_shortcut(W, h, x, y)

Let’s verify that this gives us the same results as the longer matrix multiplication above:


[[ 0.01909135  0.09545676  0.07636541  0.02035314  0.10176572]
 [ 0.08141258 -0.07830167 -0.39150833 -0.31320667  0.02313243]
 [ 0.11566214  0.09252971  0.01572474  0.07862371  0.06289897]]

Now we have a simple function that will calculate the changes to the weights for a seemingly complicated single-layer of a neural network.

The Softmax Function Derivative (Part 1)


This post demonstrates the calculations behind the evaluation of the Softmax Derivative using Python. It is based on the excellent article by Eli Bendersky which can be found here.

The Softmax Function

The softmax function simply takes a vector of N dimensions and returns a probability distribution also of N dimensions. Each element of the output is in the range (0,1) and the sum of the elements of N is 1.0.

Each element of the output is given by the formula:

\sigma(z_{i})=\frac{e^{z_{i}}} {\sum_{j=1}^K e^{z_{j}}}

See for more details.

import numpy as np
x = np.random.random([5])

def softmax_basic(z):
    exps = np.exp(z)
    sums = np.sum(exps)
    return np.divide(exps, sums)


This should generate an output that looks something like this:

array([0.97337094, 0.85251098, 0.62495691, 0.63957056, 0.6969253 ])

We expect that the sum of those will be (close to) 1.0:


And so it is:


Calculating the derivative

We need to calculate the partial derivative of the probability outputs S_{i} with respect to each of the inputs x_{j}. For example:

\frac{\delta{S_{i}}}{\delta{x_{j}}}=\frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}}

Following Bendersky’s derivation, we need to use the quotient rule for derivatives:

\text{If }f(x)=\frac{g(x)}{h(x)} \text{ then:}


From the Softmax function:

\begin{array}{rcl} g_{i} & = & e^{x_{i}}\\ h_{i} & = & \sum_{k=1}^N e^{x_{k}} \end{array}

The derivatives of these functions with respect to x_{j} are:

\frac{\delta{g_{i}}}{\delta{x_{j}}} = \left \{ \begin{aligned} &e^{x_{j}}, && \text{if}\ i=j \\ &0, && \text{otherwise} \end{aligned} \right.



Now we have to evalutate the quotient rule for the two seperate cases where i=j and where i\neq{j}:

Starting with i=j:

\frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}}=\frac{e^{x_{j}}\sum_{k=1}^N e^{x_{k}}-e^{x_{j}}e^{x_{i}}}{\left[\sum_{k=1}^N e^{x_{k}}\right]^{2}}

Now we’ll just simplify this a bit:

\begin{array}{rcl} \frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}} & = & \frac{e^{x_{j}}\left(\sum_{k=1}^N e^{x_{k}}-e^{x_{i}}\right)}{\left[\sum_{k=1}^N e^{x_{k}}\right]^{2}}\\ &=&\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}} \dot{}\frac{\sum_{k=1}^N e^{x_{k}}-e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\\ &=&\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}} \left(\frac{\sum_{k=1}^N e^{x_{k}}}{\sum_{k=1}^N e^{x_{k}}}-\frac{e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\right)\\ &=&\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}} \left(1-\frac{e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\right)\\ &=&\sigma(x_{j})(1-\sigma(x_{i})) \end{array}

For the case where i\neq j:

\begin{array}{rcl} \frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}}&=&\frac{0-e^{x_{j}}e^{x_{i}}}{\left[\sum_{k=1}^N e^{x_{k}}\right]^{2}}\\ &=&0-\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}}\dot{}\frac{e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\\ &=&0-\sigma(x_{j})\sigma(x_{i}) \end{array}

Here are some examples of those function:

S = softmax_basic(x)

# where i = j
div = S[0] * (1 - S[0])
print("S[0] :", S[0], "div :", div)

div = S[1] * (1 - S[1])
print("S[1] :", S[1], "div :", div)

div = S[2] * (1 - S[2])
print("S[2] :", S[2], "div :", div)

# where i != j
div = 0 - (S[0] * S[1])
print("S[0] :", S[0], "S[1] :", S[1], "div :", div)

div = 0 - (S[1] * S[2])
print("S[1] :", S[1], "S[2] :", S[2], "div :", div)

div = 0 - (S[2] * S[3])
print("S[2] :", S[2], "S[3] :", S[3], "div :", div)

Here’s the output (yours may have different numbers):

S[0] : 0.16205870737014744 div : 0.13579568273566436
S[1] : 0.1530827392322831 div : 0.12964841418142392
S[2] : 0.22661095862407418 div : 0.1752584320555523
S[0] : 0.16205870737014744 S[1] : 0.1530827392322831 div : -0.024808390840665155
S[1] : 0.1530827392322831 S[2] : 0.22661095862407418 div : -0.034690226286226845
S[2] : 0.22661095862407418 S[3] : 0.226342511074585 div : -0.051291693411991836

Let’s tidy up the formula a little so we can come up with a decent implementation in the code.

Looking at the case where i=j:

\frac{\delta{S_{i}}}{\delta{x_{j}}}=\sigma(x_{j})(1-\sigma(x_{i}))\\ =\sigma(x_{j}) - \sigma(x_{j})\sigma(x_{i})

And where i\neq j


We can generalise that formula by calculating the Jacobian matrix. This matrix will look like this:

\frac{\delta{S}}{\delta{x}} = \left [ \begin{array}{ccc} \frac{\delta{S_{1}}}{\delta{x_{1}}} & \ldots & \frac{\delta{S_{1}}}{\delta{x_{N}}} \\ \ldots & \frac{\delta{S_{i}}}{\delta{x_{j}}} & \ldots \\ \frac{\delta{S_{N}}}{\delta{x_{1}}} & \ldots & \frac{\delta{S_{N}}}{\delta{x_{N}}} \end{array} \right ]

\frac{\delta{S}}{\delta{x}} = \left [ \begin{array}{ccc} \sigma(x_{1}) - \sigma(x_{1})\sigma(x_{1}) & \ldots & 0-\sigma(x_{1})\sigma(x_{N}) \\ \ldots & \sigma(x_{j}) - \sigma(x_{j})\sigma(x_{i}) & \ldots \\ 0-\sigma(x_{N})\sigma(x_{1}) & \ldots & \sigma(x_{N}) - \sigma(x_{N})\sigma(x_{N}) \end{array} \right ]

\frac{\delta{S}}{\delta{x}} = \left [ \begin{array}{ccc} \sigma(x_{1}) & \ldots & 0 \\ \ldots & \sigma(x_{j}) & \ldots \\ 0 & \ldots & \sigma(x_{N})\end{array} \right ] - \left[ \begin{array}{ccc} \sigma(x_{1})\sigma(x_{1}) & \ldots & \sigma(x_{1})\sigma(x_{N}) \\ \ldots & \sigma(x_{j})\sigma(x_{i}) & \ldots \\ \sigma(x_{N})\sigma(x_{1}) & \ldots & \sigma(x_{N})\sigma(x_{N}) \end{array} \right ]

The matrix on the left is simply the vector S laid out along a diagonal. Numpy provides a diag function to do this for us:


And that generates a matrix for us:

array([[0.16205871, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.15308274, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.22661096, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.22634251, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.23190508]])

The matrix on the right is comprised of every element in S multiplied by all the elements in S. So we can create the first matrix by repeating S in rows, create the second matrix by repeating S in columns (i.e. transposing the first matrix) and finally, multiplying them together element-wise. Numpy comes to our help again by providing a tile function:

S_vector = S.reshape(S.shape[0],1)
S_matrix = np.tile(S_vector,S.shape[0])
print(S_matrix, '\n')

And the output of this (again, yours will have different numbers due to the random initialisation):

[[0.16205871 0.16205871 0.16205871 0.16205871 0.16205871]
 [0.15308274 0.15308274 0.15308274 0.15308274 0.15308274]
 [0.22661096 0.22661096 0.22661096 0.22661096 0.22661096]
 [0.22634251 0.22634251 0.22634251 0.22634251 0.22634251]
 [0.23190508 0.23190508 0.23190508 0.23190508 0.23190508]] 

[[0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
 [0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
 [0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
 [0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
 [0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]]

Finally, let’s bring that all together into the single formula to calculate the Jacobian derivative of the Softmax function is:

np.diag(S) - (S_matrix * np.transpose(S_matrix))

Example output:

array([[ 0.13579568, -0.02480839, -0.03672428, -0.03668077, -0.03758224],
       [-0.02480839,  0.12964841, -0.03469023, -0.03464913, -0.03550067],
       [-0.03672428, -0.03469023,  0.17525843, -0.05129169, -0.05255223],
       [-0.03668077, -0.03464913, -0.05129169,  0.17511158, -0.05248998],
       [-0.03758224, -0.03550067, -0.05255223, -0.05248998,  0.17812512]])

So that’s the Softmax function and it’s derivative.

In the next part, we’ll look at how to combine this derivative with the summation function, common in artificial neural networks.

Neural Networks and the generalisation problem

Over the last few weeks, a robust debate has been taking place online about the prospects that Deep Learning neural networks would lead to advances in the quest for Artificial General Intelligence.

All current AI is what is known as Artificial Narrow Intelligence. This means that the models work well (sometimes extremely well) on specific problems that are well defined. Unfortunately, they are also quite brittle and do not generalise to other problems, or even variants of the problem they are trained on. By contrast, a long-term goal of the field is to get to AIs that can generalise and extrapolate, amongst other things. This is called Artificial General Intelligence.

The debate started back in September when Geoffrey Hinton proposed that researchers should start looking at alternatives to the default back propagation algorithms that are currently quite successful. This was followed up by a more detailed critical review published by Gary Marcus earlier this month outlining many of the problems with neural networks and deep learning. There has been quite a bit of debate about the merits of Marcus’ points on social media, so much so that he published a defence on Medium, responding to the various criticisms raised.

One of the most serious points is that artificial neural networks don’t generalise and cannot extrapolate from what they have been trained on to new instances with different characteristics. This is quite simple to demonstrate using Keras and TensorFlow. In this example, we have a neural network that tries to learn to reverse the digits in a binary number:

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import SGD

x = np.array([[0,0,0,0], [0,0,0,1], [0,0,1,0],
      [0,1,0,1], [0,1,1,0], [0,1,1,1]])
y = np.array([[0,0,0,0], [1,0,0,0], [0,1,0,0],
      [1,0,1,0], [0,1,1,0], [1,1,1,0]])

model = Sequential()
model.add(Dense(4, input_shape=(4,)))
model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.1)),y, epochs=10000, batch_size=8, verbose=0)

The above model  trains the network on examples where the first digit is always 0. If we test the trained network, it will perform as expected, reversing the order of the digits:


generates the following output:

array([[0., 0., 0., 0.],
   [1., 0., 0., 0.],
   [0., 1., 0., 0.],
   [1., 0., 1., 0.],
   [0., 1., 1., 0.],
   [1., 1., 1., 0.]], dtype=float32)

Which is exactly what we want. However, when we change the inputs to begin with ‘1’ then the network gets confused:

offdim = np.array([[1,0,0,0], [1,0,0,1], [1,0,1,0],
  [1,1,0,1], [1,1,1,0], [1,1,1,1]])

as we can see in the following output:

array([[0., 0., 0., 0.],
   [1., 0., 0., 0.],
   [0., 1., 0., 0.],
   [1., 0., 1., 0.],
   [0., 1., 1., 0.],
   [1., 1., 1., 0.]], dtype=float32)

Clearly the network isn’t learning anything about the concept of “reversing” even though it looks like it initially. It has completely ignored the new number.

This lack of generalising demonstrates clearly that neural networks, as currently architected, don’t operate on any level of abstraction. This is one of the fundamental problems with neural networks that must be solved if we have any hope of them advancing our efforts towards Artificial General Intelligence.

As Marcus points out in his defence article, there may be a need for a blend of techniques, including old-school symbolic AI to help deep learning networks move forward towards solving the generalisation problem.


Deep Learning, A Critical Appraisal – Gary Marcus

In defense of skepticism about deep learning – Gary Marcus

Sample code on Github:

Deep Learning Dead-End?

Deep Learning is at the core of much of modern Artificial Intelligence. It has had some spectacular recent successes, not least being a major part of the system that beat the world champion at Go.

Key to its success is the Back-Propagation algorithm, usually shortened to “Backprop”. I’ve written elsewhere about how this algorithm works, but essentially, it takes an error in the output of a neural network and propagates it backwards through the network, adjusting the network’s configuration as it goes to reduce that output error.

This algorithm has been so successful that deep learning neural networks are finding applications in a broad range of industries. Much of the recent popularisation of artificial intelligence and machine learning is due to this very success.

Now one of the longest proponents of this algorithm, Geoffrey Hinton from the University of Toronto has suggested that if progress is to be made on AI, then focus must shift away from Backprop. His view is based on the observation that the brain doesn’t learn that way and if intelligent machines are to be developed, new techniques are required.

In particular, he suggests that Unsupervised Learning will prove to be a fertile area. The method of learning does not rely on having fully labelled or classified datasets the way Supervised Learning and Backprop need. Instead it tries to understand patterns within the data itself to be able to make predictions and classifications.

Deep Learning does still have a lot to offer. However, given it’s requirements for large amounts of data and computing power, there is an increasing awareness that alternative machine learning techniques may prove to be equally fruitful.

Source: Artificial intelligence pioneer says we need to start over – Axios

XOR Revisited: Keras and TensorFlow

A few weeks ago, it was announced that Keras would be getting official Google support and would become part of the TensorFlow machine learning library. Keras is a collection of high-level APIs in Python for creating and training neural networks, using either Theano or TensorFlow as the underlying engine.

Given my previous posts on implementing an XOR-solving neural network in a variety of different languages and tools, I thought it was time to see what it would look like in Keras.

XOR can be expressed as a classification problem that is best illustrated in a diagram. The goal is to create a neural network that will correctly predict the values 0 or 1, depending on the inputs x1 and x2 as shown.

xor graph

The neural network that is capable of being trained to solve that problem looks like this:


If you’d like to understand why this is the case, have a look at the detailed explanation in the posts implementing the solution in Octave.

So how does this look in Keras? Well it’s rather simple. Assuming you’ve already installed Keras, we’ll start with setting up the classification problem and the expected outputs:

import numpy as np

x = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([[0], [1], [1], [0]])

So far, so good. We’re using numpy arrays to store our inputs (x) and outputs (y). Now for the neural network definition:

from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(2, input_shape=(2,)))

The Sequential model is simply a sequence of layers making up the network. Our diagram above has a set of inputs being fed into two processing layers. We’ve already defined the inputs, so all we need to do is add the other two layers.

In Keras, we’ll use Dense layers, which simply means they are is fully connected. The parameters indicate that the first layer has two nodes and the second layer has one node, corresponding to the diagram above.

The first layer also has the shape of the inputs which in this case is a one-dimensional vector with 2 elements. The second layer’s inputs will be inferred from the first layer.

We then add an Activation of type ‘sigmoid’ to each layer, again matching our neural network definition.

Note that Keras looks after the bias input without us having to explicitly code for it. In addition, Keras also looks after the weights (Θ1 and Θ2). This makes our neural network definition really straightforward and shows the benefits of using a high-level abstraction.

Finally, we apply a loss function and learning mode for Keras to be able to adjust the neural network:

model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy'])

In this example, we’ll use the standard Mean Squared Error loss function and Stochastic Gradient Descent optimiser. And that’s it for the network definition.

If you want to see that the network looks like, use:


The network should look like this:

>> model.summary()
Layer (type)                  Output Shape         Param # 
dense_1 (Dense)               (None, 2)            6 
activation_1 (Activation)     (None, 2)            0 
dense_2 (Dense)               (None, 1)            3 
activation_2 (Activation)     (None, 1)            0 
Total params: 9
Trainable params: 9
Non-trainable params: 0

Now we just need to kick off the training of the network.,y, epochs=100000, batch_size=4)

All going well, the network weights will converge on a solution that can correctly classify the inputs (if not, you may need to up the number of epochs):

>>> model.predict(x, verbose=1)

4/4 [==============================] - 0s

array([[ 0.07856689],

       [ 0.91362464],

       [ 0.92543262],

       [ 0.06886736]], dtype=float32)


Clearly this network is on it’s way to converging on the original expected outputs we defined above (y).

So that’s all there is to a Keras version of the XOR-solving neural network. The fact that it is using TensorFlow as the engine is completely hidden and that makes implementing the network a lot simpler.


Solving XOR with a Neural Network in Python

In the previous few posts, I detailed a simple neural network to solve the XOR problem in a nice handy package called Octave.

I find Octave quite useful as it is built to do linear algebra and matrix operations, both of which are crucial to standard feed-forward multi-layer neural networks. However, it isn’t that fast and you would not be building any deep-learning models on large datasets using it.

Coding in Python

There is also a numerical operation library available in Python called NumPy. This library has found widespread use in building neural networks, so I wanted to compare a similar network using it to a network in Octave.

The last post showed an Octave function to solve the XOR problem. Recall the problem was that we wanted to have a neural network correctly generate an output of zero when x1 and x2 are the same (the yellow circles) and output of one when x1 and x2 are different (the blue circles):

xor graph

Here is the topology of the network we want to train:


Lastly, here is a function in Python that is equivalent to the Octave xor_nn function. The code also includes a sigmoid function:

import numpy as np
import math

def sigmoid(x):
        return 1.0 / (1.0 + np.exp(-x))

def xor_nn(XOR, THETA1, THETA2, init_w=0, learn=0, alpha=0.01):
        if init_w == 1:
                THETA1 = 2*np.random.random([2,3]) - 1
                THETA2 = 2*np.random.random([1,3]) - 1

        T1_DELTA = np.zeros(THETA1.shape)
        T2_DELTA = np.zeros(THETA2.shape)
        m = 0
        J = 0.0

        for x in XOR:
                A1 = np.vstack(([1], np.transpose(x[0:2][np.newaxis])))
                Z2 =, A1)
                A2 = np.vstack(([1], sigmoid(Z2)))
                Z3 =, A2)
                h = sigmoid(Z3)

                J = J + (x[2] * math.log(h[0])) + ((1 - x[2]) * math.log(1 - h[0]));
                m = m + 1;
                if learn == 1:
                        delta3 = h - x[2]
                        delta2 = (, delta3) * (A2 * (1 - A2)))[1:]
                        T2_DELTA = T2_DELTA +, np.transpose(A2))
                        T1_DELTA = T1_DELTA +, np.transpose(A1))

        J = J / -m

        if learn == 1:
                THETA1 = THETA1 - (alpha * (T1_DELTA / m))
                THETA2 = THETA2 - (alpha * (T2_DELTA / m))

        return (THETA1, THETA2)

(This code is available on Github if you want to download it: Python NN on GitHub)

If you want more detail on how this function works, have a look back at Part 1, Part 2 and Part 3 of the series on the Octave version.

Comparing Python and Octave

To be sure that they both operate identically, I first generated some random numbers. These numbers were used to initialize the parameters (THETA1 and THETA2) in both functions. If you run several epochs (I ran 1000), then you will see that the values of THETA1 and THETA2 remain identical in Octave and Python. This makes sense as the only non-deterministic part of the algorithm is the initialization of the network’s parameters (i.e. the weights).

So we can be sure that both functions are executing the same algorithm.

The next step is to test how fast the networks run a large number of epochs. On my (ageing) MacBook, the Octave code runs 1000 epochs in about 9.5 seconds on average, while the Python code runs the same number in just 5.4 seconds. This is a pretty good performance improvement for what is practically the same code.

So if you are familiar with Python and want to start developing your own neural networks, then NumPy will give you the tools you need.



A Simple Neural Network in Octave – Part 3

This is the final post in a short series looking at implementing a small neural network to solve the XOR problem in Octave.

In the first post, we looked at how to set up the parameters of the network (the weights between the nodes), feed in an example and get the network to predict some output. The second post looked at implementing the back propagation algorithm, which adjusts those parameters in the network to improve the accuracy of it’s outputs.

Now it’s time to pull those pieces together so that we can let the network adjust the parameters to get as close to the right output as we desire.

First, we’ll set up a function to run all the training examples through the network. Presenting all the training examples once (one after the other) is called an epoch. The goal of this function will be to return the updated parameters after an epoch:

function [THETA1_new, THETA2_new] = xor_nn(XOR, THETA1, THETA2, init_w=0, learn=0, alpha=0.01)

Just a quick note on the notation and parameters in this function header. The first part after the word “function” specifies what this function returns. In our case, we want new versions of the parameters in the network (represented by THETA1 and THETA2).

The name of the function is “xor_nn” and the parameters to the function are contained within the brackets:

  • XOR This is the representation of the training set.
  • THETA1 & THETA2 Current values for the parameters in the network.
  • init_w=0 This tells the function to initialise the weights in the network
  • learn=0 This tells the network to learn from the examples (see below)
  • alpha=0.01 This is the learning rate (default value is 0.01)

Note that any parameter that has an “=” sign is optional and will get the default value shown if it is not provided explicitly.

The first step in our function is to check if we need to initialize the weights.

if (init_w == 1)
    THETA1 = 2*rand(2,3) - 1;
    THETA2 = 2*rand(1,3) - 1;

This is simply the same code as before, inside an “if” block.

Now we initialize the cost variable. This is done for every epoch:

J = 0.0;

In the previous post, we looked at updating the weights in the network after calculating the cost after the network has processed a single training example. This is a specific type of learning called “online learning”. There is another type of learning called “batch learning”, and this works by updating the weights once after all the training examples have been processed. Let’s implement that instead in our function.

We need to record the number of training examples and the total delta across all the training examples (rather than just across one as before). So here are the extra variables required:

T1_DELTA = zeros(size(THETA1));
T2_DELTA = zeros(size(THETA2));
m = 0;

Remember that THETA1 and THETA2 are matrices, so we need to initialize every element of the matrix, and make our delta matrices the same size. We’ll also use “m” to record the number of training examples that we present to the network in the epoch.

Now lets set up a loop to present those training examples to the network one by one:

for i = 1:rows(XOR)

This simply says repeat the following block for the same number of rows that exist in our XOR data set.

Now put in the code from Part 1 that processes an input:

A1 = [1; XOR(i,1:2)'];
Z2 = THETA1 * A1;
A2 = [1; sigmoid(Z2)];
Z3 = THETA2 * A2;
h = sigmoid(Z3);
J = J + ( XOR(i,3) * log(h) ) + ( (1 - XOR(i,3)) * log(1 - h) );
m = m + 1;

Note the slight change in moving the bias node from the layer input calculation (Z3) to the output from the previous layer (A2). This just makes the code slightly simpler.

Then we add the code from Part 2, inside a test to see if we are in learning mode. The code has been slightly modified as we are implementing batch learning rather than online learning. In batch mode, we want to calculate the errors across all the examples:

if (learn == 1)
    delta3 = h - XOR(i,3);
    delta2 = ((THETA2' * delta3) .* (A2 .* (1 - A2)))(2:end);
    T2_DELTA = T2_DELTA + (delta3 * A2');
    T1_DELTA = T1_DELTA + (delta2 * A1');
    disp('Hypothesis for '), disp(XOR(i,1:2)), disp('is '), disp(h);

If we’re not learning from this example, then we simply display the cost of this particular example.

That’s the end of the loop, so we close the block in Octave with:


Now we calculate the average cost across all the examples:

J = J / -m;

This gives us a useful guide to see if the cost is reducing each time we run the function (which it should!).

Now we’ll update the weights in the network, remembering to divide by the number of training examples as we are in batch mode:

if (learn==1)
    THETA1 = THETA1 - (alpha * (T1_DELTA / m));
    THETA2 = THETA2 - (alpha * (T2_DELTA / m));
    disp('J: '), disp(J);

Lastly, we’ll set the return values as our new updated weights:

THETA1_new = THETA1;
THETA2_new = THETA2;

And ending the function:


So that’s all that’s required to run all the training examples through the network. If you run the function a few times, you should see the cost of the network reducing as we expect it to:


If you run the function as shown, you’ll see that the cost (J) reduces each time, but not by a lot. It would be a bit boring to have to run the function each time, so let’s set up a short script to run it many times:

XOR = [0,0,0; 0,1,1; 1,0,1; 1,1,0];
THETA1 = 0;
THETA2 = 0;

[THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2, 1, 1, 0.01);
for i = 1:100000
    [THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2, 0, 1, 0.01);
    if (mod(i,1000) == 0)
        disp('Iteration : '), disp(i)
        [THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2);

There’s not a lot here that’s new. The first call to the xor_nn function initialises the weights. The loop calls the function 100,000 times, printing out the cost every 1000 iterations. That may seem like a lot of function calls, but remember that the network weights are being adjusted by a small amount each time.

If you run that script, you should see something like this as the output:


When the loop gets to the end, the output will be close to this (the exact numbers may be different):


As you can see, the network guesses small numbers (close to 0) for the first and last XOR examples and high (close to 1) for the two middle examples. This is close to what we want the network to do, so we’ve successfully trained this particular network to recognise the XOR function.

If you wish to download the code directly, it’s available here on Github:

There are a lot of things that we should do to make sure that the algorithm is optimised, including checking that the default learning rate of 0.01 is actually the right rate. But that is a job for another day.

Now that you’ve seen a simple neural network recognizing patterns and learning from examples, you can implement your own in Octave for other more interesting problems.

A Simple Neural Network in Octave – Part 2

In the last post in this short series, we looked at how to build a small neural network to solve the XOR problem. The network we built can generate an output for any of the inputs that we gave it from the table. It’s most likely that it sometimes gets the wrong answer, because the weights on the links were randomly generated.

Now we’ll look at getting the network to learn from it’s mistakes so that it gets better each time. In the context of neural networks, learning means adjusting those weights so it gets better and better at giving the right answers.

The first step is to find out how wrong the network is. This is called the cost or loss of the network. There are several ways to do this from the simple to the complex. A common approach is called “logistic regression”, which avoids problems of the neural network getting stuck in it’s learning.

Recall from our XOR table that there are two possible outputs, either 1 or 0. There will be two different versions of the cost depending on the output:

lr cost function

Remember that hθ(x) is the hypothesis that the network produces, given the input x=(x1,x2) and the weights θ. “log” is the standard mathematical function that calculates the natural logarithm of the value given (hence the name “logistic regression”).

We can combine these two instances together in a single formula that makes it easier and faster to translate into code:

lr cost function simple

So our code to represent this:

y = 0;
J = ((y * log(h)) + ((1 - y) * log(1 - h))) * -1 ;

This gives us a good idea how the network is performing on this particular input, with y set to the expected output from the first row of the XOR table.

A better network will have a lower cost, so our goal in training the network is to minimize the cost. The algorithm used to do that is called the back propagation algorithm (or backprop for short).

The first step is to evaluate the error at the output node (layer 3):


Note that the superscript 3 refers to the layer and doesn’t mean cubed! The code in Octave is:

delta3 = h - y;

So far, so straightforward. Backprop takes the error at a particular layer within a network and propagates it backward through the network. In general, the formula for that is:

nn node backprop calc

This looks a little complicated so let’s dissect it to make it easier to understand and translate into Octave code.

Firstly, the l superscript on some of the terms in the formula refers to the layer in the network. The next layer we need to work out is layer 2, so l = 2. We already know what δ3 is from the previous code above.

Next, Θ(l-1) refers to the matrix of weights at the layer l-1. The T superscript means to transpose the matrix (basically swap the rows and columns). This makes it easier for us to multiple the two matrices. Octave being what it is makes this very easy for us, so the first part of the formula is easy to translate:

THETA2’ * delta3

Note the apostrophe after THETA2. This is Octave’s way of transposing a matrix.

The second part of the formula represents the derivative of the activation function. Fortunately, there is an easy way to calculate that without getting into the complexity of calculus:


The “.*” means element-wise multiplication of a matrix. So each term in the first matrix is multiplied by the corresponding term in the second matrix. In Octave, it’s almost exactly the same (for layer 2):

Z2 .* (1 - Z2)

Putting those two parts together then:

delta2 = ((THETA2' * delta3) .* (Z2 .* (1 – Z2)))(2:end);

Remember that these are matrix operations, so delta2 will be a matrix too. The last part of the line of code (2:end) means to take the second element of the matrix to the end. The reason this is done is because we are not propagating an error back from the bias node.

At this point, we have the error in the output at layer 3 and layer 2. Since layer 1 is the input layer, there isn’t any error there, so we don’t need to calculate anything.

The last step in the backprop is to adjust the weights now that we know those errors. This is also quite simple:

weight adj

The formula says that the change to the weights Θ(l) is the activation values multiplied by the errors, adjusted by the value α, which represents the learning rate. The learning rate governs how large the adjustments to the weights are. It’s a bit of a misnomer in that it doesn’t relate to how “fast” the network learns, although that can be the effect of having a relatively larger value. For solving the XOR problem, a learning rate of 0.01 is sufficient.

So in Octave:

THETA2 = THETA2 - (0.01 * (delta3 * A2'));
THETA1 = THETA1 - (0.01 * (delta2 * A1'));

Once we’ve adjusted the weights in the matrices, the network will give us a slightly different hypothesis for each of our inputs. We can recalculate the cost function above to see how much improvement there is.

If we repeat the calculations, the cost should start reducing and the network will get better and better at producing the desired output.


In the next post, we’ll pull all the code together and see what happens when we start training the network across all the examples.

The last part of this series is here.

A Simple Neural Network In Octave – Part 1

Getting started with neural networks can seem to be a daunting prospect, even if you have some programming experience. The many examples on the Internet dive straight into the mathematics of what the neural network is doing or are full of jargon that can make it a little difficult to understand what’s going on, not to mention how to implement it in actual code.

So I decided to write a post to help myself understand the mechanics and it turns out that it will require a few parts to get through it!

Anyway, to start with, there is a great free numerical computation package called Octave that you can use to play around with Machine Learning concepts. Octave itself does not know about neural networks, but it does know how to do fast matrix multiplication. This important feature of Octave will be made clear later.

You can find out more and download the Octave software here:

A nice toy problem to start with is the XOR problem. XOR means “exclusive OR’ and it is best explained in a table:

Given this input

Produce this output


x2 y










1 1


What the table shows is that there are two inputs (labelled x1 and x2) and one output (labelled y). When x1 and x2 are both set to 0, the output we expect is also 0. Similarly, when x1 and x2 are both set to 1, the output is also 0. However, when x1 and x2 are set to different inputs, then the output will be 1.

The challenge is to build a neural network that can successfully learn to produce the correct output given the four different inputs in the table.

Let’s have a quick look at a graphical representation of the problem:

xor graph

The graph shows the two inputs x1 and x2 on their respective axes. Where x1 and x2 have the same value, the graph shows yellow circles and similarly where x1 and x2 are different, the graph shows blue circles.

There’s an important constraint that the graph shows clearly. It isn’t possible to draw a single straight line across the graph so that the yellow circles are on one side and the blue circles are on the other side. This is called “linear-separability”. So the XOR problem is not linearly separable, which means that we are going to need a multi-layer neural network to solve it.

The diagram below shows a typical configuration for a neural network that can be trained to solve the XOR problem.


There are a number of things to note about about this particular network. Firstly, the inputs in the table above (x1 and x2), are mapped directly onto the nodes represented by a1 and a2. Secondly, this first layer of nodes also contains a bias node, that has its output always set to +1. Thirdly, the nodes in the middle of the diagram are collectively called the hidden nodes or hidden layer. It also contains a bias node set to +1. The outputs of the other nodes are labelled with a small greek letter sigma, which will become clearer below. Lastly, the output of the network is labelled h.

It’s useful to represent the inputs as a vector (a one-dimensional matrix) that looks like this:


This can be translated directly into Octave as:

A1 = [1; 0; 0];

In the above example, 1 represents the bias node, and the two zeros represent the first row of the variables from our table above replacing the a1 and a2 in the vector. You can replace the two zeros with values from other rows on the table to see what happens to the output after we’ve built up the network.

The links from the nodes in the first layer to the nodes in the second layer have weights associated with them, denoted by the letter theta (along with a superscript 1) in the diagram of our network. Similarly, the weights in layer 2 are also shown with a superscript 2.

The subscript numbers identify the nodes at either end of the link, in the form (i,j), where i is the node receiving the signal and j is the node sending the signal. This is slightly counter-intuitive, where the normal expectation is that the signal moves from i to j. The reason for this is that it is possible to represent all the weights at a given layer in a single matrix that looks like this:


Typically, the initial values of the weights in a network are set to random values between -1 and +1, since we have no idea what they actually should be. We can do this in Octave as follows:

THETA1 = 2*rand(2,3) - 1;

And for Θ2, which has 3 nodes linked to the one final node:

THETA2 = 2*rand(1,3) - 1;

So now we can simply multiply those matrices together to work out what the input to the second layer is:


where Z2 represents the input to the second layer of nodes. This multiplication will result in another vector (1 dimensional matrix). Our Octave code for this is simply:

Z2 = THETA1 * A1;

This is a lot better (and faster) than having to calculate each of these inputs separately. In fact, most machine learning libraries will provide fast matrix multiplication (and other matrix operations), precisely because it is an efficient way to model machine learning strategies.

To calculate the output of layer 2, we must apply a function to the input. The typical function used is called a sigmoid function (represented by the sigma in the network diagram) and it looks like this in Octave:

function [result] = sigmoid(x)
    result = 1.0 ./ (1.0 + exp(-x));

So the output of layer 2 is the sigmoid of the input, or


which in Octave is:

A2 = [1; sigmoid(Z2)];

Note that we add the extra 1 as the first element in the vector to represent the bias that will be needed as an input into layer 3.

We repeat the process for layer 3, multiplying the output of layer 2 by the matrix of weights for layer 2 to get the input for layer 3 and then getting the sigmoid of the result:



The output from the network is then a single value, called our hypothesis (h). This is the network’s guess at the output given it’s input. The Octave code for this is:

Z3 = THETA2 * A2;
h = sigmoid(Z3);

That’s the network fully constructed. We can put any values from the table in the front (putting them in the A1 vector) and see what the output from the network is (Hypothesis).

Here is an example of the above code running:


It is almost certain that the network will get the wrong answer (outputting a 1 when it should be outputting a 0 and vice versa). The example above shows h to be 0.31328 (you may get a completely different value), which is clearly wrong, as for a (0,0) input, we should get an output of 0.

In the next post in this series, we’ll look at how to get the network to learn from its mistakes and how it can get much better at outputting correct values.

Here is the next part of this series