Getting started with neural networks can seem to be a daunting prospect, even if you have some programming experience. The many examples on the Internet dive straight into the mathematics of what the neural network is doing or are full of jargon that can make it a little difficult to understand what’s going on, not to mention how to implement it in actual code.
So I decided to write a post to help myself understand the mechanics and it turns out that it will require a few parts to get through it!
Anyway, to start with, there is a great free numerical computation package called Octave that you can use to play around with Machine Learning concepts. Octave itself does not know about neural networks, but it does know how to do fast matrix multiplication. This important feature of Octave will be made clear later.
You can find out more and download the Octave software here:
https://www.gnu.org/software/octave/
A nice toy problem to start with is the XOR problem. XOR means “exclusive OR’ and it is best explained in a table:
Given this input |
Produce this output |
|
x1 |
x2 | y |
0 |
0 |
0 |
0 |
1 |
1 |
1 |
0 |
1 |
1 | 1 |
0 |
What the table shows is that there are two inputs (labelled x1 and x2) and one output (labelled y). When x1 and x2 are both set to 0, the output we expect is also 0. Similarly, when x1 and x2 are both set to 1, the output is also 0. However, when x1 and x2 are set to different inputs, then the output will be 1.
The challenge is to build a neural network that can successfully learn to produce the correct output given the four different inputs in the table.
Let’s have a quick look at a graphical representation of the problem:
The graph shows the two inputs x1 and x2 on their respective axes. Where x1 and x2 have the same value, the graph shows yellow circles and similarly where x1 and x2 are different, the graph shows blue circles.
There’s an important constraint that the graph shows clearly. It isn’t possible to draw a single straight line across the graph so that the yellow circles are on one side and the blue circles are on the other side. This is called “linear-separability”. So the XOR problem is not linearly separable, which means that we are going to need a multi-layer neural network to solve it.
The diagram below shows a typical configuration for a neural network that can be trained to solve the XOR problem.
There are a number of things to note about about this particular network. Firstly, the inputs in the table above (x1 and x2), are mapped directly onto the nodes represented by a1 and a2. Secondly, this first layer of nodes also contains a bias node, that has its output always set to +1. Thirdly, the nodes in the middle of the diagram are collectively called the hidden nodes or hidden layer. It also contains a bias node set to +1. The outputs of the other nodes are labelled with a small greek letter sigma, which will become clearer below. Lastly, the output of the network is labelled h.
It’s useful to represent the inputs as a vector (a one-dimensional matrix) that looks like this:
This can be translated directly into Octave as:
A1 = [1; 0; 0];
In the above example, 1 represents the bias node, and the two zeros represent the first row of the variables from our table above replacing the a1 and a2 in the vector. You can replace the two zeros with values from other rows on the table to see what happens to the output after we’ve built up the network.
The links from the nodes in the first layer to the nodes in the second layer have weights associated with them, denoted by the letter theta (along with a superscript 1) in the diagram of our network. Similarly, the weights in layer 2 are also shown with a superscript 2.
The subscript numbers identify the nodes at either end of the link, in the form (i,j), where i is the node receiving the signal and j is the node sending the signal. This is slightly counter-intuitive, where the normal expectation is that the signal moves from i to j. The reason for this is that it is possible to represent all the weights at a given layer in a single matrix that looks like this:
Typically, the initial values of the weights in a network are set to random values between -1 and +1, since we have no idea what they actually should be. We can do this in Octave as follows:
THETA1 = 2*rand(2,3) - 1;
And for Θ2, which has 3 nodes linked to the one final node:
THETA2 = 2*rand(1,3) - 1;
So now we can simply multiply those matrices together to work out what the input to the second layer is:
where Z2 represents the input to the second layer of nodes. This multiplication will result in another vector (1 dimensional matrix). Our Octave code for this is simply:
Z2 = THETA1 * A1;
This is a lot better (and faster) than having to calculate each of these inputs separately. In fact, most machine learning libraries will provide fast matrix multiplication (and other matrix operations), precisely because it is an efficient way to model machine learning strategies.
To calculate the output of layer 2, we must apply a function to the input. The typical function used is called a sigmoid function (represented by the sigma in the network diagram) and it looks like this in Octave:
function [result] = sigmoid(x) result = 1.0 ./ (1.0 + exp(-x)); end
So the output of layer 2 is the sigmoid of the input, or
which in Octave is:
A2 = [1; sigmoid(Z2)];
Note that we add the extra 1 as the first element in the vector to represent the bias that will be needed as an input into layer 3.
We repeat the process for layer 3, multiplying the output of layer 2 by the matrix of weights for layer 2 to get the input for layer 3 and then getting the sigmoid of the result:
The output from the network is then a single value, called our hypothesis (h). This is the network’s guess at the output given it’s input. The Octave code for this is:
Z3 = THETA2 * A2; h = sigmoid(Z3);
That’s the network fully constructed. We can put any values from the table in the front (putting them in the A1 vector) and see what the output from the network is (Hypothesis).
Here is an example of the above code running:
It is almost certain that the network will get the wrong answer (outputting a 1 when it should be outputting a 0 and vice versa). The example above shows h to be 0.31328 (you may get a completely different value), which is clearly wrong, as for a (0,0) input, we should get an output of 0.
In the next post in this series, we’ll look at how to get the network to learn from its mistakes and how it can get much better at outputting correct values.