Solving XOR with a Neural Network in TensorFlow

The tradition of writing a trilogy in five parts has a long and noble history, pioneered by the great Douglas Adams in the Hitchhiker’s Guide to the Galaxy. This post is no exception and follows from the previous four looking at a Neural Network that solves the XOR problem.

This time, I had a interest in checking out Google’s machine learning system, TensorFlow.

Last November, Google open sourced the machine learning library that it uses within their own products. There are two API’s; one for Python and one for C++.  Naturally, it makes sense to see what TensorFlow would make of the same network that we previously looked at and compare both Python-based neural networks.

You can download the TensorFlow library and see some of the tutorials here.

It’s worth noting that TensorFlow requires a little mental agility to understand. Essentially. the first few lines of code set up the inputs, the network architecture, the cost function, and the method to use to train the network. Although these look like the same steps as the steps in Python or Octave, they don’t in fact do anything. This is because TensorFlow considers those to be the model to use, running them only within a session. This model is in the form of a directed graph. Don’t worry if that’s not too clear yet.

TensorFlow Code

To start with, we need to load in the TensorFlow library:

import tensorflow as tf

The next step is to set up placeholders to hold the input data. TensorFlow will automatically fill them with the data when we run the network. In our XOR problem, we have four different training examples and each example has two features. There are also four expected outputs, each with just one value (either a 0 or 1). In TensorFlow, this looks like this:

x_ = tf.placeholder(tf.float32, shape=[4,2], name="x-input")
y_ = tf.placeholder(tf.float32, shape=[4,1], name="y-input")

I’ve set up the inputs to be floating point numbers rather than the more natural integers to avoid having to cast them to floating points when multiplying the weights later on. The shape parameter tells the placeholder what the dimensions are of data we’ll be passing in.

The next step is to set up the parameters for the network. These are called Variables in TensorFlow.  Variables will be modified by TensorFlow during the training steps.

Theta1 = tf.Variable(tf.random_uniform([2,2], -1, 1), name="Theta1")
Theta2 = tf.Variable(tf.random_uniform([2,1], -1, 1), name="Theta2")

For our Theta matrices, we want them initialized to random values between -1 and +1, so we use the built-in random_uniform function to do that.

In TensorFlow, we set up the bias nodes separately, but still as Variables. This let’s the algorithms modify the values of the bias node. This is mathematically equivalent to having a signal value of 1 and initial weights of 0 on the links from the bias nodes.

Bias1 = tf.Variable(tf.zeros([2]), name="Bias1")
Bias2 = tf.Variable(tf.zeros([1]), name="Bias2")

Now we set up the model. This is pretty much the same as that outlined in the previous posts on Python and Octave:

A2 = tf.sigmoid(tf.matmul(x_, Theta1) + Bias1)
Hypothesis = tf.sigmoid(tf.matmul(A2, Theta2) + Bias2)

Here, matmul is TensorFlow’s matrix multiplication function, and sigmoid naturally is the sigmoid calculation function.

As before, our cost function is the average over all the training examples:

cost = tf.reduce_mean(( (y_ * tf.log(Hypothesis)) + 
        ((1 - y_) * tf.log(1.0 - Hypothesis)) ) * -1)

So far, that has been relatively straightforward. Let’s look at training the network.

TensorFlow ships with several different training algorithms, but for comparison purposes with our previous implementations, we’re going to use the gradient descent algorithm:

train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

What this statement says is that we’re going to use GradientDescentOptimizer as our training algorithm, the learning rate (alpha from before) is going to be 0.01 and we want to minimize the cost function above. This means that we don’t have to implement our own algorithm as we did in the previous examples.

That’s all there is to setting up the network. Now we just have to go through a few initialization steps before running the examples through the network:

XOR_X = [[0,0],[0,1],[1,0],[1,1]]
XOR_Y = [[0],[1],[1],[0]]

init = tf.initialize_all_variables()
sess = tf.Session()

As I mentioned above, TensorFlow runs a model inside a session, which it uses to maintain the state of the variables as they are passed through the network we’ve set up. So the first step in that session is to initialise all the Variables from above. This step allocates values to the various Variables in accordance with how we set them up (i.e. random numbers for Theta and zeros for Bias).

The next step is to run some epochs:

for i in range(100000):, feed_dict={x_: XOR_X, y_: XOR_Y})

Each time the training step is executed, the values in the dictionary feed_dict are loaded into the placeholders that we set up at the beginning. As the XOR problem is relatively simple, each epoch will contain the entire training set. To see what’s going on inside the loop, just print out the values of the Variables:

if i % 1000 == 0:
        print('Epoch ', i)
        print('Hypothesis ',, feed_dict={x_: XOR_X, y_: XOR_Y}))
        print('Theta1 ',
        print('Bias1 ',
        print('Theta2 ',
        print('Bias2 ',
        print('cost ',, feed_dict={x_: XOR_X, y_: XOR_Y}))

That’s it. If you run this in Python, you’ll get something that looks like this after 99000 epochs:


As you can see in the display for the Hypothesis variable, the network has learned to output nearly correct values for the inputs.


The TensorFlow Graph

I mentioned above that the model was in the form of a directed graph. TensorFlow let’s us see what that graph looks like:


We can see that our inputs x-input and y-input are the starts of the graph, and that they flow through the processes at layer2 and layer3, ultimately being used in the cost function.

To see the graph yourself, TensorFlow includes a utility called TensorBoard. Inside your code, before the statement add the following line:

writer = tf.train.SummaryWriter("./logs/xor_logs", sess.graph_def)

The folder in the quotes can point to a folder on your machine where you want to store the output. Running TensorBoard is then as simple as entering this at a command prompt:

$ tensorboard --logdir=/path/to/your/log/file/folder

In your browser, enter http://localhost:6006 as the URL and then click on Graph tab. You will then see the full graph.

So that’s an example neural network in TensorFlow.

There is one thing I did notice in putting this together: it’s quite slow. In Octave, I was able to run 10,000 epochs in about 9.5 seconds. The Python/NumPy example was able to run in 5.8 seconds. The above TensorFlow example runs in about 28 seconds on my laptop.

TensorFlow is however built to run on GPUs and the model can be split across multiple machines. This would go some way to improving performance, but there is a way to go to make it comparable with existing libraries.

UPDATE 26/01/2016:

I’ve placed the full code up on GitHub here:


Update 07/03/2016:

I’ve changed the GitHub code to run a more realistic 100,000 epochs. Should converge now. [Thanks Konstantin!]


Solving XOR with a Neural Network in Python

In the previous few posts, I detailed a simple neural network to solve the XOR problem in a nice handy package called Octave.

I find Octave quite useful as it is built to do linear algebra and matrix operations, both of which are crucial to standard feed-forward multi-layer neural networks. However, it isn’t that fast and you would not be building any deep-learning models on large datasets using it.

Coding in Python

There is also a numerical operation library available in Python called NumPy. This library has found widespread use in building neural networks, so I wanted to compare a similar network using it to a network in Octave.

The last post showed an Octave function to solve the XOR problem. Recall the problem was that we wanted to have a neural network correctly generate an output of zero when x1 and x2 are the same (the yellow circles) and output of one when x1 and x2 are different (the blue circles):

xor graph

Here is the topology of the network we want to train:


Lastly, here is a function in Python that is equivalent to the Octave xor_nn function. The code also includes a sigmoid function:

import numpy as np
import math

def sigmoid(x):
        return 1.0 / (1.0 + np.exp(-x))

def xor_nn(XOR, THETA1, THETA2, init_w=0, learn=0, alpha=0.01):
        if init_w == 1:
                THETA1 = 2*np.random.random([2,3]) - 1
                THETA2 = 2*np.random.random([1,3]) - 1

        T1_DELTA = np.zeros(THETA1.shape)
        T2_DELTA = np.zeros(THETA2.shape)
        m = 0
        J = 0.0

        for x in XOR:
                A1 = np.vstack(([1], np.transpose(x[0:2][np.newaxis])))
                Z2 =, A1)
                A2 = np.vstack(([1], sigmoid(-Z2)))
                Z3 =, A2)
                h = sigmoid(-Z3)

                J = J + (x[2] * math.log(h[0])) + ((1 - x[2]) * math.log(1 - h[0]));
                m = m + 1;
                if learn == 1:
                        delta3 = h - x[2]
                        delta2 = (, delta3) * (A2 * (1 - A2)))[1:]
                        T2_DELTA = T2_DELTA +, np.transpose(A2))
                        T1_DELTA = T1_DELTA +, np.transpose(A1))

        J = J / -m

        if learn == 1:
                THETA1 = THETA1 - (alpha * (T1_DELTA / m))
                THETA2 = THETA2 - (alpha * (T2_DELTA / m))

        return (THETA1, THETA2)

(This code is available on Github if you want to download it: Python NN on GitHub)

If you want more detail on how this function works, have a look back at Part 1, Part 2 and Part 3 of the series on the Octave version.

Comparing Python and Octave

To be sure that they both operate identically, I first generated some random numbers. These numbers were used to initialize the parameters (THETA1 and THETA2) in both functions. If you run several epochs (I ran 1000), then you will see that the values of THETA1 and THETA2 remain identical in Octave and Python. This makes sense as the only non-deterministic part of the algorithm is the initialization of the network’s parameters (i.e. the weights).

So we can be sure that both functions are executing the same algorithm.

The next step is to test how fast the networks run a large number of epochs. On my (ageing) MacBook, the Octave code runs 1000 epochs in about 9.5 seconds on average, while the Python code runs the same number in just 5.4 seconds. This is a pretty good performance improvement for what is practically the same code.

So if you are familiar with Python and want to start developing your own neural networks, then NumPy will give you the tools you need.



A Simple Neural Network in Octave – Part 3

This is the final post in a short series looking at implementing a small neural network to solve the XOR problem in Octave.

In the first post, we looked at how to set up the parameters of the network (the weights between the nodes), feed in an example and get the network to predict some output. The second post looked at implementing the back propagation algorithm, which adjusts those parameters in the network to improve the accuracy of it’s outputs.

Now it’s time to pull those pieces together so that we can let the network adjust the parameters to get as close to the right output as we desire.

First, we’ll set up a function to run all the training examples through the network. Presenting all the training examples once (one after the other) is called an epoch. The goal of this function will be to return the updated parameters after an epoch:

function [THETA1_new, THETA2_new] = xor_nn(XOR, THETA1, THETA2, init_w=0, learn=0, alpha=0.01)

Just a quick note on the notation and parameters in this function header. The first part after the word “function” specifies what this function returns. In our case, we want new versions of the parameters in the network (represented by THETA1 and THETA2).

The name of the function is “xor_nn” and the parameters to the function are contained within the brackets:

  • XOR This is the representation of the training set.
  • THETA1 & THETA2 Current values for the parameters in the network.
  • init_w=0 This tells the function to initialise the weights in the network
  • learn=0 This tells the network to learn from the examples (see below)
  • alpha=0.01 This is the learning rate (default value is 0.01)

Note that any parameter that has an “=” sign is optional and will get the default value shown if it is not provided explicitly.

The first step in our function is to check if we need to initialize the weights.

if (init_w == 1)
    THETA1 = 2*rand(2,3) - 1;
    THETA2 = 2*rand(1,3) - 1;

This is simply the same code as before, inside an “if” block.

Now we initialize the cost variable. This is done for every epoch:

J = 0.0;

In the previous post, we looked at updating the weights in the network after calculating the cost after the network has processed a single training example. This is a specific type of learning called “online learning”. There is another type of learning called “batch learning”, and this works by updating the weights once after all the training examples have been processed. Let’s implement that instead in our function.

We need to record the number of training examples and the total delta across all the training examples (rather than just across one as before). So here are the extra variables required:

T1_DELTA = zeros(size(THETA1));
T2_DELTA = zeros(size(THETA2));
m = 0;

Remember that THETA1 and THETA2 are matrices, so we need to initialize every element of the matrix, and make our delta matrices the same size. We’ll also use “m” to record the number of training examples that we present to the network in the epoch.

Now lets set up a loop to present those training examples to the network one by one:

for i = 1:rows(XOR)

This simply says repeat the following block for the same number of rows that exist in our XOR data set.

Now put in the code from Part 1 that processes an input:

A1 = [1; XOR(i,1:2)'];
Z2 = THETA1 * A1;
A2 = [1; sigmoid(Z2)];
Z3 = THETA2 * A2;
h = sigmoid(Z3);
J = J + ( XOR(i,3) * log(h) ) + ( (1 - XOR(i,3)) * log(1 - h) );
m = m + 1;

Note the slight change in moving the bias node from the layer input calculation (Z3) to the output from the previous layer (A2). This just makes the code slightly simpler.

Then we add the code from Part 2, inside a test to see if we are in learning mode. The code has been slightly modified as we are implementing batch learning rather than online learning. In batch mode, we want to calculate the errors across all the examples:

if (learn == 1)
    delta3 = h - XOR(i,3);
    delta2 = ((THETA2' * delta3) .* (A2 .* (1 - A2)))(2:end);
    T2_DELTA = T2_DELTA + (delta3 * A2');
    T1_DELTA = T1_DELTA + (delta2 * A1');
    disp('Hypothesis for '), disp(XOR(i,1:2)), disp('is '), disp(h);

If we’re not learning from this example, then we simply display the cost of this particular example.

That’s the end of the loop, so we close the block in Octave with:


Now we calculate the average cost across all the examples:

J = J / -m;

This gives us a useful guide to see if the cost is reducing each time we run the function (which it should!).

Now we’ll update the weights in the network, remembering to divide by the number of training examples as we are in batch mode:

if (learn==1)
    THETA1 = THETA1 - (alpha * (T1_DELTA / m));
    THETA2 = THETA2 - (alpha * (T2_DELTA / m));
    disp('J: '), disp(J);

Lastly, we’ll set the return values as our new updated weights:

THETA1_new = THETA1;
THETA2_new = THETA2;

And ending the function:


So that’s all that’s required to run all the training examples through the network. If you run the function a few times, you should see the cost of the network reducing as we expect it to:


If you run the function as shown, you’ll see that the cost (J) reduces each time, but not by a lot. It would be a bit boring to have to run the function each time, so let’s set up a short script to run it many times:

XOR = [0,0,0; 0,1,1; 1,0,1; 1,1,0];
THETA1 = 0;
THETA2 = 0;

[THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2, 1, 1, 0.01);
for i = 1:100000
    [THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2, 0, 1, 0.01);
    if (mod(i,1000) == 0)
        disp('Iteration : '), disp(i)
        [THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2);

There’s not a lot here that’s new. The first call to the xor_nn function initialises the weights. The loop calls the function 100,000 times, printing out the cost every 1000 iterations. That may seem like a lot of function calls, but remember that the network weights are being adjusted by a small amount each time.

If you run that script, you should see something like this as the output:


When the loop gets to the end, the output will be close to this (the exact numbers may be different):


As you can see, the network guesses small numbers (close to 0) for the first and last XOR examples and high (close to 1) for the two middle examples. This is close to what we want the network to do, so we’ve successfully trained this particular network to recognise the XOR function.

If you wish to download the code directly, it’s available here on Github:

There are a lot of things that we should do to make sure that the algorithm is optimised, including checking that the default learning rate of 0.01 is actually the right rate. But that is a job for another day.

Now that you’ve seen a simple neural network recognizing patterns and learning from examples, you can implement your own in Octave for other more interesting problems.

A Simple Neural Network in Octave – Part 2

In the last post in this short series, we looked at how to build a small neural network to solve the XOR problem. The network we built can generate an output for any of the inputs that we gave it from the table. It’s most likely that it sometimes gets the wrong answer, because the weights on the links were randomly generated.

Now we’ll look at getting the network to learn from it’s mistakes so that it gets better each time. In the context of neural networks, learning means adjusting those weights so it gets better and better at giving the right answers.

The first step is to find out how wrong the network is. This is called the cost or loss of the network. There are several ways to do this from the simple to the complex. A common approach is called “logistic regression”, which avoids problems of the neural network getting stuck in it’s learning.

Recall from our XOR table that there are two possible outputs, either 1 or 0. There will be two different versions of the cost depending on the output:

lr cost function

Remember that hθ(x) is the hypothesis that the network produces, given the input x=(x1,x2) and the weights θ. “log” is the standard mathematical function that calculates the natural logarithm of the value given (hence the name “logistic regression”).

We can combine these two instances together in a single formula that makes it easier and faster to translate into code:

lr cost function simple

So our code to represent this:

y = 0;
J = ((y * log(h)) + ((1 - y) * log(1 - h))) * -1 ;

This gives us a good idea how the network is performing on this particular input, with y set to the expected output from the first row of the XOR table.

A better network will have a lower cost, so our goal in training the network is to minimize the cost. The algorithm used to do that is called the back propagation algorithm (or backprop for short).

The first step is to evaluate the error at the output node (layer 3):


Note that the superscript 3 refers to the layer and doesn’t mean cubed! The code in Octave is:

delta3 = h - y;

So far, so straightforward. Backprop takes the error at a particular layer within a network and propagates it backward through the network. In general, the formula for that is:

nn node backprop calc

This looks a little complicated so let’s dissect it to make it easier to understand and translate into Octave code.

Firstly, the l superscript on some of the terms in the formula refers to the layer in the network. The next layer we need to work out is layer 2, so l = 2. We already know what δ3 is from the previous code above.

Next, Θ(l-1) refers to the matrix of weights at the layer l-1. The T superscript means to transpose the matrix (basically swap the rows and columns). This makes it easier for us to multiple the two matrices. Octave being what it is makes this very easy for us, so the first part of the formula is easy to translate:

THETA2’ * delta3

Note the apostrophe after THETA2. This is Octave’s way of transposing a matrix.

The second part of the formula represents the derivative of the activation function. Fortunately, there is an easy way to calculate that without getting into the complexity of calculus:


The “.*” means element-wise multiplication of a matrix. So each term in the first matrix is multiplied by the corresponding term in the second matrix. In Octave, it’s almost exactly the same (for layer 2):

Z2 .* (1 - Z2)

Putting those two parts together then:

delta2 = ((THETA2' * delta3) .* (Z2 .* (1 – Z2)))(2:end);

Remember that these are matrix operations, so delta2 will be a matrix too. The last part of the line of code (2:end) means to take the second element of the matrix to the end. The reason this is done is because we are not propagating an error back from the bias node.

At this point, we have the error in the output at layer 3 and layer 2. Since layer 1 is the input layer, there isn’t any error there, so we don’t need to calculate anything.

The last step in the backprop is to adjust the weights now that we know those errors. This is also quite simple:

weight adj

The formula says that the change to the weights Θ(l) is the activation values multiplied by the errors, adjusted by the value α, which represents the learning rate. The learning rate governs how large the adjustments to the weights are. It’s a bit of a misnomer in that it doesn’t relate to how “fast” the network learns, although that can be the effect of having a relatively larger value. For solving the XOR problem, a learning rate of 0.01 is sufficient.

So in Octave:

THETA2 = THETA2 - (0.01 * (delta3 * A2'));
THETA1 = THETA1 - (0.01 * (delta2 * A1'));

Once we’ve adjusted the weights in the matrices, the network will give us a slightly different hypothesis for each of our inputs. We can recalculate the cost function above to see how much improvement there is.

If we repeat the calculations, the cost should start reducing and the network will get better and better at producing the desired output.


In the next post, we’ll pull all the code together and see what happens when we start training the network across all the examples.

The last part of this series is here.