This is the final post in a short series looking at implementing a small neural network to solve the XOR problem in Octave.

In the first post, we looked at how to set up the parameters of the network (the weights between the nodes), feed in an example and get the network to predict some output. The second post looked at implementing the back propagation algorithm, which adjusts those parameters in the network to improve the accuracy of it’s outputs.

Now it’s time to pull those pieces together so that we can let the network adjust the parameters to get as close to the right output as we desire.

First, we’ll set up a function to run all the training examples through the network. Presenting all the training examples once (one after the other) is called an epoch. The goal of this function will be to return the updated parameters after an epoch:

function [THETA1_new, THETA2_new] = xor_nn(XOR, THETA1, THETA2, init_w=0, learn=0, alpha=0.01)

Just a quick note on the notation and parameters in this function header. The first part after the word “function” specifies what this function returns. In our case, we want new versions of the parameters in the network (represented by THETA1 and THETA2).

The name of the function is “xor_nn” and the parameters to the function are contained within the brackets:

- XOR This is the representation of the training set.
- THETA1 & THETA2 Current values for the parameters in the network.
- init_w=0 This tells the function to initialise the weights in the network
- learn=0 This tells the network to learn from the examples (see below)
- alpha=0.01 This is the learning rate (default value is 0.01)

Note that any parameter that has an “=” sign is optional and will get the default value shown if it is not provided explicitly.

The first step in our function is to check if we need to initialize the weights.

if (init_w == 1) THETA1 = 2*rand(2,3) - 1; THETA2 = 2*rand(1,3) - 1; endif

This is simply the same code as before, inside an “if” block.

Now we initialize the cost variable. This is done for every epoch:

J = 0.0;

In the previous post, we looked at updating the weights in the network after calculating the cost after the network has processed a single training example. This is a specific type of learning called “online learning”. There is another type of learning called “batch learning”, and this works by updating the weights once after *all* the training examples have been processed. Let’s implement that instead in our function.

We need to record the number of training examples and the total delta across all the training examples (rather than just across one as before). So here are the extra variables required:

T1_DELTA = zeros(size(THETA1)); T2_DELTA = zeros(size(THETA2)); m = 0;

Remember that THETA1 and THETA2 are matrices, so we need to initialize every element of the matrix, and make our delta matrices the same size. We’ll also use “m” to record the number of training examples that we present to the network in the epoch.

Now lets set up a loop to present those training examples to the network one by one:

for i = 1:rows(XOR)

This simply says repeat the following block for the same number of rows that exist in our XOR data set.

Now put in the code from Part 1 that processes an input:

A1 = [1; XOR(i,1:2)']; Z2 = THETA1 * A1; A2 = [1; sigmoid(Z2)]; Z3 = THETA2 * A2; h = sigmoid(Z3);J = J + ( XOR(i,3) * log(h) ) + ( (1 - XOR(i,3)) * log(1 - h) ); m = m + 1;

Note the slight change in moving the bias node from the layer input calculation (Z3) to the output from the previous layer (A2). This just makes the code slightly simpler.

Then we add the code from Part 2, inside a test to see if we are in learning mode. The code has been slightly modified as we are implementing batch learning rather than online learning. In batch mode, we want to calculate the errors across all the examples:

if (learn == 1) delta3 = h - XOR(i,3); delta2 = ((THETA2' * delta3) .* (A2 .* (1 - A2)))(2:end); T2_DELTA = T2_DELTA + (delta3 * A2'); T1_DELTA = T1_DELTA + (delta2 * A1'); else disp('Hypothesis for '), disp(XOR(i,1:2)), disp('is '), disp(h); endif

If we’re not learning from this example, then we simply display the cost of this particular example.

That’s the end of the loop, so we close the block in Octave with:

endfor

Now we calculate the average cost across all the examples:

J = J / -m;

This gives us a useful guide to see if the cost is reducing each time we run the function (which it should!).

Now we’ll update the weights in the network, remembering to divide by the number of training examples as we are in batch mode:

if (learn==1) THETA1 = THETA1 - (alpha * (T1_DELTA / m)); THETA2 = THETA2 - (alpha * (T2_DELTA / m)); else disp('J: '), disp(J); endif

Lastly, we’ll set the return values as our new updated weights:

THETA1_new = THETA1; THETA2_new = THETA2;

And ending the function:

endfunction

So that’s all that’s required to run all the training examples through the network. If you run the function a few times, you should see the cost of the network reducing as we expect it to:

If you run the function as shown, you’ll see that the cost (J) reduces each time, but not by a lot. It would be a bit boring to have to run the function each time, so let’s set up a short script to run it many times:

XOR = [0,0,0; 0,1,1; 1,0,1; 1,1,0]; THETA1 = 0; THETA2 = 0; [THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2, 1, 1, 0.01);for i = 1:100000 [THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2, 0, 1, 0.01); if (mod(i,1000) == 0) disp('Iteration : '), disp(i) [THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2); endif endfor

There’s not a lot here that’s new. The first call to the xor_nn function initialises the weights. The loop calls the function 100,000 times, printing out the cost every 1000 iterations. That may seem like a lot of function calls, but remember that the network weights are being adjusted by a small amount each time.

If you run that script, you should see something like this as the output:

When the loop gets to the end, the output will be close to this (the exact numbers may be different):

As you can see, the network guesses small numbers (close to 0) for the first and last XOR examples and high (close to 1) for the two middle examples. This is close to what we want the network to do, so we’ve successfully trained this particular network to recognise the XOR function.

If you wish to download the code directly, it’s available here on Github:

https://github.com/StephenOman/Octave/tree/master/xor%20neural%20network

There are a lot of things that we should do to make sure that the algorithm is optimised, including checking that the default learning rate of 0.01 is actually the right rate. But that is a job for another day.

Now that you’ve seen a simple neural network recognizing patterns and learning from examples, you can implement your own in Octave for other more interesting problems.

Just found your blog, hence the late reply on a old post, but great walkthrough! I’m eager to try this. I have an implementation like this in C++ but Octave is a lot more “dev friendly” and great for testing 🙂

If you still read these, I was wondering, sometimes complex neural nets have local minima. Would it help if you used the partial derivatives of the more sophisicated cost function “J” you mentioned before to do gradient descent?

LikeLike

Hello Richard and thanks for reading! Sorry about the delay in replying.

In fact, for multilayered neural networks, the algorithm does actually calculate partial derivatives of the cost function. Using the calculus chain-rule, it is possible to show that modifying the weights in early layers are actually derivatives with respect to ‘J’.

It’s somewhat surprising that complex neural networks don’t usually get stuck in local minima. This is due to the high-dimensionality of the problem space. For example, consider a network being training on images 20 pixels X 20 pixels x 3 colour channels. That’s an input vector of dimension 1200. For the network to get stuck in a local minimum, it would have to be in all 1200 dimensions – highly unlikely.

A more usual problem is that the solution space is quite flat, so that the training wanders around for a while with no appreciable improvement.

Hope that helps!

LikeLiked by 1 person

Hi Stephen,

If you are still reading this, thank you for the amazing tutorial.

I was wondering how the value of cost function ‘J’ is being used in updating weights ‘theta’. Also, for ‘delta_2’, the equation described in Part2 and the equation coded here seem to be different, in that Z.*(1-Z) is used in Part2 where as A.*(1-A) is coded here.

Thanks

LikeLike

Great post. Very instructive and gives good feeling what NN training is. With little changes easily applicable in matlab.

LikeLike

Hi Stephen,

I have the same question like vikas.

Regards

Ulrich

LikeLike

Hello Stephen,

Not sure if you are still following this thread. However, I spent many hours trying implementing regularization to your example without getting the expected results. I wonder if you can explain how to do it in the same clear fashion as you have explained neural network.

Thanks,

Yaron

LikeLike

Hello Yaron,

Thanks for reading the post and following up. That’s a neat idea and I’ll add it to the list of topics I’ve got on the go at the moment.

Best wishes,

Stephen.

LikeLike