Solving XOR with a Neural Network in Python

In the previous few posts, I detailed a simple neural network to solve the XOR problem in a nice handy package called Octave.

I find Octave quite useful as it is built to do linear algebra and matrix operations, both of which are crucial to standard feed-forward multi-layer neural networks. However, it isn’t that fast and you would not be building any deep-learning models on large datasets using it.

Coding in Python

There is also a numerical operation library available in Python called NumPy. This library has found widespread use in building neural networks, so I wanted to compare a similar network using it to a network in Octave.

The last post showed an Octave function to solve the XOR problem. Recall the problem was that we wanted to have a neural network correctly generate an output of zero when x1 and x2 are the same (the yellow circles) and output of one when x1 and x2 are different (the blue circles):

xor graph

Here is the topology of the network we want to train:


Lastly, here is a function in Python that is equivalent to the Octave xor_nn function. The code also includes a sigmoid function:

import numpy as np
import math

def sigmoid(x):
        return 1.0 / (1.0 + np.exp(-x))

def xor_nn(XOR, THETA1, THETA2, init_w=0, learn=0, alpha=0.01):
        if init_w == 1:
                THETA1 = 2*np.random.random([2,3]) - 1
                THETA2 = 2*np.random.random([1,3]) - 1

        T1_DELTA = np.zeros(THETA1.shape)
        T2_DELTA = np.zeros(THETA2.shape)
        m = 0
        J = 0.0

        for x in XOR:
                A1 = np.vstack(([1], np.transpose(x[0:2][np.newaxis])))
                Z2 =, A1)
                A2 = np.vstack(([1], sigmoid(Z2)))
                Z3 =, A2)
                h = sigmoid(Z3)

                J = J + (x[2] * math.log(h[0])) + ((1 - x[2]) * math.log(1 - h[0]));
                m = m + 1;
                if learn == 1:
                        delta3 = h - x[2]
                        delta2 = (, delta3) * (A2 * (1 - A2)))[1:]
                        T2_DELTA = T2_DELTA +, np.transpose(A2))
                        T1_DELTA = T1_DELTA +, np.transpose(A1))

        J = J / -m

        if learn == 1:
                THETA1 = THETA1 - (alpha * (T1_DELTA / m))
                THETA2 = THETA2 - (alpha * (T2_DELTA / m))

        return (THETA1, THETA2)

(This code is available on Github if you want to download it: Python NN on GitHub)

If you want more detail on how this function works, have a look back at Part 1, Part 2 and Part 3 of the series on the Octave version.

Comparing Python and Octave

To be sure that they both operate identically, I first generated some random numbers. These numbers were used to initialize the parameters (THETA1 and THETA2) in both functions. If you run several epochs (I ran 1000), then you will see that the values of THETA1 and THETA2 remain identical in Octave and Python. This makes sense as the only non-deterministic part of the algorithm is the initialization of the network’s parameters (i.e. the weights).

So we can be sure that both functions are executing the same algorithm.

The next step is to test how fast the networks run a large number of epochs. On my (ageing) MacBook, the Octave code runs 1000 epochs in about 9.5 seconds on average, while the Python code runs the same number in just 5.4 seconds. This is a pretty good performance improvement for what is practically the same code.

So if you are familiar with Python and want to start developing your own neural networks, then NumPy will give you the tools you need.



A Simple Neural Network in Octave – Part 3

This is the final post in a short series looking at implementing a small neural network to solve the XOR problem in Octave.

In the first post, we looked at how to set up the parameters of the network (the weights between the nodes), feed in an example and get the network to predict some output. The second post looked at implementing the back propagation algorithm, which adjusts those parameters in the network to improve the accuracy of it’s outputs.

Now it’s time to pull those pieces together so that we can let the network adjust the parameters to get as close to the right output as we desire.

First, we’ll set up a function to run all the training examples through the network. Presenting all the training examples once (one after the other) is called an epoch. The goal of this function will be to return the updated parameters after an epoch:

function [THETA1_new, THETA2_new] = xor_nn(XOR, THETA1, THETA2, init_w=0, learn=0, alpha=0.01)

Just a quick note on the notation and parameters in this function header. The first part after the word “function” specifies what this function returns. In our case, we want new versions of the parameters in the network (represented by THETA1 and THETA2).

The name of the function is “xor_nn” and the parameters to the function are contained within the brackets:

  • XOR This is the representation of the training set.
  • THETA1 & THETA2 Current values for the parameters in the network.
  • init_w=0 This tells the function to initialise the weights in the network
  • learn=0 This tells the network to learn from the examples (see below)
  • alpha=0.01 This is the learning rate (default value is 0.01)

Note that any parameter that has an “=” sign is optional and will get the default value shown if it is not provided explicitly.

The first step in our function is to check if we need to initialize the weights.

if (init_w == 1)
    THETA1 = 2*rand(2,3) - 1;
    THETA2 = 2*rand(1,3) - 1;

This is simply the same code as before, inside an “if” block.

Now we initialize the cost variable. This is done for every epoch:

J = 0.0;

In the previous post, we looked at updating the weights in the network after calculating the cost after the network has processed a single training example. This is a specific type of learning called “online learning”. There is another type of learning called “batch learning”, and this works by updating the weights once after all the training examples have been processed. Let’s implement that instead in our function.

We need to record the number of training examples and the total delta across all the training examples (rather than just across one as before). So here are the extra variables required:

T1_DELTA = zeros(size(THETA1));
T2_DELTA = zeros(size(THETA2));
m = 0;

Remember that THETA1 and THETA2 are matrices, so we need to initialize every element of the matrix, and make our delta matrices the same size. We’ll also use “m” to record the number of training examples that we present to the network in the epoch.

Now lets set up a loop to present those training examples to the network one by one:

for i = 1:rows(XOR)

This simply says repeat the following block for the same number of rows that exist in our XOR data set.

Now put in the code from Part 1 that processes an input:

A1 = [1; XOR(i,1:2)'];
Z2 = THETA1 * A1;
A2 = [1; sigmoid(Z2)];
Z3 = THETA2 * A2;
h = sigmoid(Z3);
J = J + ( XOR(i,3) * log(h) ) + ( (1 - XOR(i,3)) * log(1 - h) );
m = m + 1;

Note the slight change in moving the bias node from the layer input calculation (Z3) to the output from the previous layer (A2). This just makes the code slightly simpler.

Then we add the code from Part 2, inside a test to see if we are in learning mode. The code has been slightly modified as we are implementing batch learning rather than online learning. In batch mode, we want to calculate the errors across all the examples:

if (learn == 1)
    delta3 = h - XOR(i,3);
    delta2 = ((THETA2' * delta3) .* (A2 .* (1 - A2)))(2:end);
    T2_DELTA = T2_DELTA + (delta3 * A2');
    T1_DELTA = T1_DELTA + (delta2 * A1');
    disp('Hypothesis for '), disp(XOR(i,1:2)), disp('is '), disp(h);

If we’re not learning from this example, then we simply display the cost of this particular example.

That’s the end of the loop, so we close the block in Octave with:


Now we calculate the average cost across all the examples:

J = J / -m;

This gives us a useful guide to see if the cost is reducing each time we run the function (which it should!).

Now we’ll update the weights in the network, remembering to divide by the number of training examples as we are in batch mode:

if (learn==1)
    THETA1 = THETA1 - (alpha * (T1_DELTA / m));
    THETA2 = THETA2 - (alpha * (T2_DELTA / m));
    disp('J: '), disp(J);

Lastly, we’ll set the return values as our new updated weights:

THETA1_new = THETA1;
THETA2_new = THETA2;

And ending the function:


So that’s all that’s required to run all the training examples through the network. If you run the function a few times, you should see the cost of the network reducing as we expect it to:


If you run the function as shown, you’ll see that the cost (J) reduces each time, but not by a lot. It would be a bit boring to have to run the function each time, so let’s set up a short script to run it many times:

XOR = [0,0,0; 0,1,1; 1,0,1; 1,1,0];
THETA1 = 0;
THETA2 = 0;

[THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2, 1, 1, 0.01);
for i = 1:100000
    [THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2, 0, 1, 0.01);
    if (mod(i,1000) == 0)
        disp('Iteration : '), disp(i)
        [THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2);

There’s not a lot here that’s new. The first call to the xor_nn function initialises the weights. The loop calls the function 100,000 times, printing out the cost every 1000 iterations. That may seem like a lot of function calls, but remember that the network weights are being adjusted by a small amount each time.

If you run that script, you should see something like this as the output:


When the loop gets to the end, the output will be close to this (the exact numbers may be different):


As you can see, the network guesses small numbers (close to 0) for the first and last XOR examples and high (close to 1) for the two middle examples. This is close to what we want the network to do, so we’ve successfully trained this particular network to recognise the XOR function.

If you wish to download the code directly, it’s available here on Github:

There are a lot of things that we should do to make sure that the algorithm is optimised, including checking that the default learning rate of 0.01 is actually the right rate. But that is a job for another day.

Now that you’ve seen a simple neural network recognizing patterns and learning from examples, you can implement your own in Octave for other more interesting problems.

A Simple Neural Network in Octave – Part 2

In the last post in this short series, we looked at how to build a small neural network to solve the XOR problem. The network we built can generate an output for any of the inputs that we gave it from the table. It’s most likely that it sometimes gets the wrong answer, because the weights on the links were randomly generated.

Now we’ll look at getting the network to learn from it’s mistakes so that it gets better each time. In the context of neural networks, learning means adjusting those weights so it gets better and better at giving the right answers.

The first step is to find out how wrong the network is. This is called the cost or loss of the network. There are several ways to do this from the simple to the complex. A common approach is called “logistic regression”, which avoids problems of the neural network getting stuck in it’s learning.

Recall from our XOR table that there are two possible outputs, either 1 or 0. There will be two different versions of the cost depending on the output:

lr cost function

Remember that hθ(x) is the hypothesis that the network produces, given the input x=(x1,x2) and the weights θ. “log” is the standard mathematical function that calculates the natural logarithm of the value given (hence the name “logistic regression”).

We can combine these two instances together in a single formula that makes it easier and faster to translate into code:

lr cost function simple

So our code to represent this:

y = 0;
J = ((y * log(h)) + ((1 - y) * log(1 - h))) * -1 ;

This gives us a good idea how the network is performing on this particular input, with y set to the expected output from the first row of the XOR table.

A better network will have a lower cost, so our goal in training the network is to minimize the cost. The algorithm used to do that is called the back propagation algorithm (or backprop for short).

The first step is to evaluate the error at the output node (layer 3):


Note that the superscript 3 refers to the layer and doesn’t mean cubed! The code in Octave is:

delta3 = h - y;

So far, so straightforward. Backprop takes the error at a particular layer within a network and propagates it backward through the network. In general, the formula for that is:

nn node backprop calc

This looks a little complicated so let’s dissect it to make it easier to understand and translate into Octave code.

Firstly, the l superscript on some of the terms in the formula refers to the layer in the network. The next layer we need to work out is layer 2, so l = 2. We already know what δ3 is from the previous code above.

Next, Θ(l-1) refers to the matrix of weights at the layer l-1. The T superscript means to transpose the matrix (basically swap the rows and columns). This makes it easier for us to multiple the two matrices. Octave being what it is makes this very easy for us, so the first part of the formula is easy to translate:

THETA2’ * delta3

Note the apostrophe after THETA2. This is Octave’s way of transposing a matrix.

The second part of the formula represents the derivative of the activation function. Fortunately, there is an easy way to calculate that without getting into the complexity of calculus:


The “.*” means element-wise multiplication of a matrix. So each term in the first matrix is multiplied by the corresponding term in the second matrix. In Octave, it’s almost exactly the same (for layer 2):

Z2 .* (1 - Z2)

Putting those two parts together then:

delta2 = ((THETA2' * delta3) .* (Z2 .* (1 – Z2)))(2:end);

Remember that these are matrix operations, so delta2 will be a matrix too. The last part of the line of code (2:end) means to take the second element of the matrix to the end. The reason this is done is because we are not propagating an error back from the bias node.

At this point, we have the error in the output at layer 3 and layer 2. Since layer 1 is the input layer, there isn’t any error there, so we don’t need to calculate anything.

The last step in the backprop is to adjust the weights now that we know those errors. This is also quite simple:

weight adj

The formula says that the change to the weights Θ(l) is the activation values multiplied by the errors, adjusted by the value α, which represents the learning rate. The learning rate governs how large the adjustments to the weights are. It’s a bit of a misnomer in that it doesn’t relate to how “fast” the network learns, although that can be the effect of having a relatively larger value. For solving the XOR problem, a learning rate of 0.01 is sufficient.

So in Octave:

THETA2 = THETA2 - (0.01 * (delta3 * A2'));
THETA1 = THETA1 - (0.01 * (delta2 * A1'));

Once we’ve adjusted the weights in the matrices, the network will give us a slightly different hypothesis for each of our inputs. We can recalculate the cost function above to see how much improvement there is.

If we repeat the calculations, the cost should start reducing and the network will get better and better at producing the desired output.


In the next post, we’ll pull all the code together and see what happens when we start training the network across all the examples.

The last part of this series is here.

A Simple Neural Network In Octave – Part 1

Getting started with neural networks can seem to be a daunting prospect, even if you have some programming experience. The many examples on the Internet dive straight into the mathematics of what the neural network is doing or are full of jargon that can make it a little difficult to understand what’s going on, not to mention how to implement it in actual code.

So I decided to write a post to help myself understand the mechanics and it turns out that it will require a few parts to get through it!

Anyway, to start with, there is a great free numerical computation package called Octave that you can use to play around with Machine Learning concepts. Octave itself does not know about neural networks, but it does know how to do fast matrix multiplication. This important feature of Octave will be made clear later.

You can find out more and download the Octave software here:

A nice toy problem to start with is the XOR problem. XOR means “exclusive OR’ and it is best explained in a table:

Given this input

Produce this output


x2 y










1 1


What the table shows is that there are two inputs (labelled x1 and x2) and one output (labelled y). When x1 and x2 are both set to 0, the output we expect is also 0. Similarly, when x1 and x2 are both set to 1, the output is also 0. However, when x1 and x2 are set to different inputs, then the output will be 1.

The challenge is to build a neural network that can successfully learn to produce the correct output given the four different inputs in the table.

Let’s have a quick look at a graphical representation of the problem:

xor graph

The graph shows the two inputs x1 and x2 on their respective axes. Where x1 and x2 have the same value, the graph shows yellow circles and similarly where x1 and x2 are different, the graph shows blue circles.

There’s an important constraint that the graph shows clearly. It isn’t possible to draw a single straight line across the graph so that the yellow circles are on one side and the blue circles are on the other side. This is called “linear-separability”. So the XOR problem is not linearly separable, which means that we are going to need a multi-layer neural network to solve it.

The diagram below shows a typical configuration for a neural network that can be trained to solve the XOR problem.


There are a number of things to note about about this particular network. Firstly, the inputs in the table above (x1 and x2), are mapped directly onto the nodes represented by a1 and a2. Secondly, this first layer of nodes also contains a bias node, that has its output always set to +1. Thirdly, the nodes in the middle of the diagram are collectively called the hidden nodes or hidden layer. It also contains a bias node set to +1. The outputs of the other nodes are labelled with a small greek letter sigma, which will become clearer below. Lastly, the output of the network is labelled h.

It’s useful to represent the inputs as a vector (a one-dimensional matrix) that looks like this:


This can be translated directly into Octave as:

A1 = [1; 0; 0];

In the above example, 1 represents the bias node, and the two zeros represent the first row of the variables from our table above replacing the a1 and a2 in the vector. You can replace the two zeros with values from other rows on the table to see what happens to the output after we’ve built up the network.

The links from the nodes in the first layer to the nodes in the second layer have weights associated with them, denoted by the letter theta (along with a superscript 1) in the diagram of our network. Similarly, the weights in layer 2 are also shown with a superscript 2.

The subscript numbers identify the nodes at either end of the link, in the form (i,j), where i is the node receiving the signal and j is the node sending the signal. This is slightly counter-intuitive, where the normal expectation is that the signal moves from i to j. The reason for this is that it is possible to represent all the weights at a given layer in a single matrix that looks like this:


Typically, the initial values of the weights in a network are set to random values between -1 and +1, since we have no idea what they actually should be. We can do this in Octave as follows:

THETA1 = 2*rand(2,3) - 1;

And for Θ2, which has 3 nodes linked to the one final node:

THETA2 = 2*rand(1,3) - 1;

So now we can simply multiply those matrices together to work out what the input to the second layer is:


where Z2 represents the input to the second layer of nodes. This multiplication will result in another vector (1 dimensional matrix). Our Octave code for this is simply:

Z2 = THETA1 * A1;

This is a lot better (and faster) than having to calculate each of these inputs separately. In fact, most machine learning libraries will provide fast matrix multiplication (and other matrix operations), precisely because it is an efficient way to model machine learning strategies.

To calculate the output of layer 2, we must apply a function to the input. The typical function used is called a sigmoid function (represented by the sigma in the network diagram) and it looks like this in Octave:

function [result] = sigmoid(x)
    result = 1.0 ./ (1.0 + exp(-x));

So the output of layer 2 is the sigmoid of the input, or


which in Octave is:

A2 = [1; sigmoid(Z2)];

Note that we add the extra 1 as the first element in the vector to represent the bias that will be needed as an input into layer 3.

We repeat the process for layer 3, multiplying the output of layer 2 by the matrix of weights for layer 2 to get the input for layer 3 and then getting the sigmoid of the result:



The output from the network is then a single value, called our hypothesis (h). This is the network’s guess at the output given it’s input. The Octave code for this is:

Z3 = THETA2 * A2;
h = sigmoid(Z3);

That’s the network fully constructed. We can put any values from the table in the front (putting them in the A1 vector) and see what the output from the network is (Hypothesis).

Here is an example of the above code running:


It is almost certain that the network will get the wrong answer (outputting a 1 when it should be outputting a 0 and vice versa). The example above shows h to be 0.31328 (you may get a completely different value), which is clearly wrong, as for a (0,0) input, we should get an output of 0.

In the next post in this series, we’ll look at how to get the network to learn from its mistakes and how it can get much better at outputting correct values.

Here is the next part of this series

Poor use for IBM’s AI Watson: Predicting Popular Sale Items

IBM’s Watson computer, Yorktown Heights, NY – Clockready

IBM’s Watson computer, Yorktown Heights, NY

Last week consumers indulged themselves in a frenzy of buying in the so-called Black Friday and Cyber Monday sales. To go with this, IBM’s Watson published a list of the “hottest 100 consumer electronics”.


The highlights were pretty obvious: Star Wars toys, Hoverboards and Nike Shoes. Nothing particularly insightful there beyond that provided by regular data mining.

This is a poor example of Artificial Intelligence. Compared with advances such as autonomous cars and virtual personal assistants, this example pales and doesn’t really add to our knowledge of how to build intelligent machines. It’s a shame as Watson is being used in far more interesting areas such as healthcare that could have much more benefits than simply pointing out the obvious.

via IBM News room – 2015-11-30 IBM’s Watson Predicts Cyber Monday’s Top Products and Trends – United States.

Playing Go probably won’t lead to better AI

The traditional Chinese board game Go. (Wikipedia)

Go is a traditional Chinese game that is played by two players and involves placing stones on a board in order to capture the opponent’s stones.

Although the rules are quite simple, and there is perfect information, the strategies involved are quite complex. The very best software available today that plays this game is not able to beat professional players for many reasons, including the number of possible moves and the large size of the board (compared to chess).

Now Demis Hassabis, who was behind DeepMind (purchased by Google in 2014), has hinted that his team has managed to get a machine to play the board game Go. Previously, the team has demonstrated an AI learning to play old video games like Breakout. The algorithms have learned to play without help better than most humans, so this team knows a thing or two about designing algorithms to learn to play games.

I’m a little skeptical though about how much conquering Go will advance the field of AI. Back in the 1990s, much excitement and publicity was generated when IBM’s Deep Blue managed to beat the reigning world chess champion, Gary Kasparov. However, this feat was more down to processing power and Deep Blue’s ability to evaluate millions of positions per second than any real advance in machine intelligence.

Certainly Kasparov didn’t play that way and so Deep Blue didn’t really give any great insights into human intelligence. While it may be interesting, beating a human at Go may well turn out to be another example of Artificial Narrow Intelligence.

via Google DeepMind Founder Says AI Machines Have Beat Board Game Go | Re/code.

Go Board Game (Wikipedia)

92% claim an understanding of Artificial Intelligence

It’s no surprise that the vast majority of people claim to understand what Artificial Intelligence is. Popular culture has been a key driver of this coupled with the recent debates on the potential benefits and threats of AI, and the visible successes of Google, Facebook and Apple.

However, there is a disconnect between what the latter have achieved and the portrayal in cinema and literature. Almost all modern AI is specifically focused on particular tasks. This is known as Artificial Narrow Intelligence. It is very effective in doing one thing, but completely useless at anything else. Imagine asking Siri to drive your car for you and you’ll get an understanding of how narrow the intelligence is.

By contrast, AI in movies or even that being discussed as an existential threat is Artificial General Intelligence. Such AGIs would be capable of independent action, motivation and autonomy. This type of AI does not exist to any great extent today and in fact seems to be as far away as when Alan Turing wrote about it over 60 years ago.

There is a good chance that the survey respondents are mistaken in their interpretation of Artificial Intelligence. Unfortunately, this mistake may turn into disillusionment when they find out just how far we have to go before there is a real HAL 9000.

Source: Don’t Be Fooled By Artificial Conventional Wisdom About Artificial Intelligence | Dominic Trigg

Machine Learning for the masses: Google’s TensorFlow

Google TensorFlowThe Artificial Intelligence community was abuzz recently with the news that Google has open-sourced it’s machine learning framework, called TensorFlow. This system was created by the Google Brain Team, working in it’s Machine Intelligence Research group.

This is not the first open source machine learning framework. Within the Python environment in particular, there are frameworks such as scikit-learn, PyBrain and others that have been around for a good while. What’s different about this new framework is that it has the backing of one of the most advanced commercial machine learning organisations, Google. In committing the project to open-source, it is inviting researchers, commercial practitioners and hobbyists to contribute to the framework. With Google’s backing, it seems destined for a long life.

But back to today. The framework has both Python and C++ APIs, with the expectation that C++ will be slightly faster on certain tasks. The instructions for installing TensorFlow are straightforward, but immediately I ran into a problem. My (slightly ageing) MacBook was running Python 2.7.5 and running TensorFlow caused a segmentation fault. Updating to Python 2.7.10 fixed the problem and I was able to successfully run though some of the tutorials.

There seems to be a wide range of neural network capabilities already available within the framework which provides much opportunity for exploration and experimentation. The tutorials cover areas such as handwriting recognition, image classification (using convolutional neural networks) and language modelling (using recurrent neural networks).

What’s also interesting is that since it’s an open source framework, the underlying code behind all these machine learning techniques is available for anyone to download, examine, modify and improve.

What will be the long-term impact of this is hard to tell. However, it is clear that Google has already put in quite a bit of effort already effort into this framework, and now that it’s out in the open, there will be lots more improvement to come.

If you want to know more and perhaps even try it out yourself, you can download TensorFlow here.

Robotics and AI could threaten up to 35% of jobs in UK

Clongriffin Railway Station In North Dublin

Who’ll be going to work in 20 years?
(Photo: William Murphy on Flickr. Licensed under Creative Commons.)

A new report from Bank of America Merrill Lynch has added to the recent spate of analysis predicting a massive impact on work and jobs by robotics and artificial intelligence (as reported in the Guardian).

They estimate that up to 35% of all jobs in the UK (47% in the US) are at risk of displacement by technology within 20 years. This is going to cause a huge shift in the type of work that people can expect to do in the future. It has important implications for education policy, jobs and economic growth. In addition, it is incumbent on politicians and policy makers to ensure that the benefits from increased automation are widely distributed.

A common counter point made is that by eliminating some jobs, technology creates other jobs. However, the authors note:

“The trend is worrisome in markets like the US because many of the jobs created in recent years are low-paying, manual or services jobs which are generally considered ‘high risk’ for replacement,” the bank says.

While 20 years may seem like far into the future, children born this year will just be entering the workforce then. They may be faced with not having any jobs to look forward to.

via Robot revolution: rise of ‘thinking’ machines could exacerbate inequality | Technology | The Guardian.