Disruption to jobs the likely effect of Artificial Intelligence

Much of the coverage in mass media about artificial intelligence and machine learning tends towards an alarmist position about robots taking over the world. In most examples, there’ll be a picture of the Terminator robot and a reference to the Elon Musk / Stephen Hawking claim that we should be afraid of AI.

I’ve written before that this isn’t really the problem that we should be thinking about. Instead, it’s the simpler narrow AI that will constantly erode jobs by carrying out specific tasks that today requires humans to do.

In an article by Jeff Goodell in Rolling Stone magazine, it appears that this point of view is finally reaching the mainstream press:

In fact, the problem with the hyperbole about killer robots is that it masks the real risks that we face from the rise of smart machines – job losses due to workers being replaced by robots, the escalation of autonomous weapons in warfare, and the simple fact that the more we depend on machines, the more we are at risk when something goes wrong, whether it’s from a technical glitch or a Chinese hacker.

via Inside the Artificial Intelligence Revolution: Pt. 1 | Rolling Stone

The last point is intriguing too. Machine Learning models are particularly opaque. Researchers understand how the algorithms learn. What’s not clear is why they are so effective in doing what they do.

But the main point is that progress won’t stop AI being deployed to do the work of people. The questions now are how will it be managed and how will society adapt?



2015 Underhanded C Contest Results Released

I’ve just found out that the results of the 2015 Underhanded C Contest have been published and (YIPPEE!!), I managed to bag a runner-up spot.

This year, the competition was sponsored by the Nuclear Threat Initiative. The goal of the contest was to fake the results of a test of a nuclear warhead to show that a warhead contained fissile material, when in fact it did not.

Congratulations to the winner, Linus Åkesson. You can read about his entry on his blog. It’s also worth reading about the other entrants’ ideas for how to hide malicious code inside normal looking code.

Here is the NTI’s article about the contest: http://www.nti.org/newsroom/news/underhanded-c-contest-highlights-challenges-nuclear-arms-control-verification-technologies/

Lastly, if you’re curious about my entry, I’ve posted the code and an explanation on GitHub: https://github.com/StephenOman/UnderhandedC2015


Solving XOR with a Neural Network in TensorFlow

The tradition of writing a trilogy in five parts has a long and noble history, pioneered by the great Douglas Adams in the Hitchhiker’s Guide to the Galaxy. This post is no exception and follows from the previous four looking at a Neural Network that solves the XOR problem.

This time, I had a interest in checking out Google’s machine learning system, TensorFlow.

Last November, Google open sourced the machine learning library that it uses within their own products. There are two API’s; one for Python and one for C++.  Naturally, it makes sense to see what TensorFlow would make of the same network that we previously looked at and compare both Python-based neural networks.

You can download the TensorFlow library and see some of the tutorials here.

It’s worth noting that TensorFlow requires a little mental agility to understand. Essentially. the first few lines of code set up the inputs, the network architecture, the cost function, and the method to use to train the network. Although these look like the same steps as the steps in Python or Octave, they don’t in fact do anything. This is because TensorFlow considers those to be the model to use, running them only within a session. This model is in the form of a directed graph. Don’t worry if that’s not too clear yet.

TensorFlow Code

To start with, we need to load in the TensorFlow library:

import tensorflow as tf

The next step is to set up placeholders to hold the input data. TensorFlow will automatically fill them with the data when we run the network. In our XOR problem, we have four different training examples and each example has two features. There are also four expected outputs, each with just one value (either a 0 or 1). In TensorFlow, this looks like this:

x_ = tf.placeholder(tf.float32, shape=[4,2], name="x-input")
y_ = tf.placeholder(tf.float32, shape=[4,1], name="y-input")

I’ve set up the inputs to be floating point numbers rather than the more natural integers to avoid having to cast them to floating points when multiplying the weights later on. The shape parameter tells the placeholder what the dimensions are of data we’ll be passing in.

The next step is to set up the parameters for the network. These are called Variables in TensorFlow.  Variables will be modified by TensorFlow during the training steps.

Theta1 = tf.Variable(tf.random_uniform([2,2], -1, 1), name="Theta1")
Theta2 = tf.Variable(tf.random_uniform([2,1], -1, 1), name="Theta2")

For our Theta matrices, we want them initialized to random values between -1 and +1, so we use the built-in random_uniform function to do that.

In TensorFlow, we set up the bias nodes separately, but still as Variables. This let’s the algorithms modify the values of the bias node. This is mathematically equivalent to having a signal value of 1 and initial weights of 0 on the links from the bias nodes.

Bias1 = tf.Variable(tf.zeros([2]), name="Bias1")
Bias2 = tf.Variable(tf.zeros([1]), name="Bias2")

Now we set up the model. This is pretty much the same as that outlined in the previous posts on Python and Octave:

A2 = tf.sigmoid(tf.matmul(x_, Theta1) + Bias1)
Hypothesis = tf.sigmoid(tf.matmul(A2, Theta2) + Bias2)

Here, matmul is TensorFlow’s matrix multiplication function, and sigmoid naturally is the sigmoid calculation function.

As before, our cost function is the average over all the training examples:

cost = tf.reduce_mean(( (y_ * tf.log(Hypothesis)) + 
        ((1 - y_) * tf.log(1.0 - Hypothesis)) ) * -1)

So far, that has been relatively straightforward. Let’s look at training the network.

TensorFlow ships with several different training algorithms, but for comparison purposes with our previous implementations, we’re going to use the gradient descent algorithm:

train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

What this statement says is that we’re going to use GradientDescentOptimizer as our training algorithm, the learning rate (alpha from before) is going to be 0.01 and we want to minimize the cost function above. This means that we don’t have to implement our own algorithm as we did in the previous examples.

That’s all there is to setting up the network. Now we just have to go through a few initialization steps before running the examples through the network:

XOR_X = [[0,0],[0,1],[1,0],[1,1]]
XOR_Y = [[0],[1],[1],[0]]

init = tf.global_variables_initializer()
sess = tf.Session()

As I mentioned above, TensorFlow runs a model inside a session, which it uses to maintain the state of the variables as they are passed through the network we’ve set up. So the first step in that session is to initialise all the Variables from above. This step allocates values to the various Variables in accordance with how we set them up (i.e. random numbers for Theta and zeros for Bias).

The next step is to run some epochs:

for i in range(100000):
        sess.run(train_step, feed_dict={x_: XOR_X, y_: XOR_Y})

Each time the training step is executed, the values in the dictionary feed_dict are loaded into the placeholders that we set up at the beginning. As the XOR problem is relatively simple, each epoch will contain the entire training set. To see what’s going on inside the loop, just print out the values of the Variables:

if i % 1000 == 0:
        print('Epoch ', i)
        print('Hypothesis ', sess.run(Hypothesis, feed_dict={x_: XOR_X, y_: XOR_Y}))
        print('Theta1 ', sess.run(Theta1))
        print('Bias1 ', sess.run(Bias1))
        print('Theta2 ', sess.run(Theta2))
        print('Bias2 ', sess.run(Bias2))
        print('cost ', sess.run(cost, feed_dict={x_: XOR_X, y_: XOR_Y}))

That’s it. If you run this in Python, you’ll get something that looks like this after 99000 epochs:


As you can see in the display for the Hypothesis variable, the network has learned to output nearly correct values for the inputs.


The TensorFlow Graph

I mentioned above that the model was in the form of a directed graph. TensorFlow let’s us see what that graph looks like:


We can see that our inputs x-input and y-input are the starts of the graph, and that they flow through the processes at layer2 and layer3, ultimately being used in the cost function.

To see the graph yourself, TensorFlow includes a utility called TensorBoard. Inside your code, before the sess.run(init) statement add the following line:

writer = tf.summary.FileWriter("./logs/xor_logs", sess.graph_def)

The folder in the quotes can point to a folder on your machine where you want to store the output. Running TensorBoard is then as simple as entering this at a command prompt:

$ tensorboard --logdir=/path/to/your/log/file/folder

In your browser, enter http://localhost:6006 as the URL and then click on Graph tab. You will then see the full graph.

So that’s an example neural network in TensorFlow.

There is one thing I did notice in putting this together: it’s quite slow. In Octave, I was able to run 10,000 epochs in about 9.5 seconds. The Python/NumPy example was able to run in 5.8 seconds. The above TensorFlow example runs in about 28 seconds on my laptop.

TensorFlow is however built to run on GPUs and the model can be split across multiple machines. This would go some way to improving performance, but there is a way to go to make it comparable with existing libraries.

UPDATE 26/01/2016:

I’ve placed the full code up on GitHub here: https://github.com/StephenOman/TensorFlowExamples/tree/master/xor%20nn


Update 07/03/2016:

I’ve changed the GitHub code to run a more realistic 100,000 epochs. Should converge now. [Thanks Konstantin!]


Solving XOR with a Neural Network in Python

In the previous few posts, I detailed a simple neural network to solve the XOR problem in a nice handy package called Octave.

I find Octave quite useful as it is built to do linear algebra and matrix operations, both of which are crucial to standard feed-forward multi-layer neural networks. However, it isn’t that fast and you would not be building any deep-learning models on large datasets using it.

Coding in Python

There is also a numerical operation library available in Python called NumPy. This library has found widespread use in building neural networks, so I wanted to compare a similar network using it to a network in Octave.

The last post showed an Octave function to solve the XOR problem. Recall the problem was that we wanted to have a neural network correctly generate an output of zero when x1 and x2 are the same (the yellow circles) and output of one when x1 and x2 are different (the blue circles):

xor graph

Here is the topology of the network we want to train:


Lastly, here is a function in Python that is equivalent to the Octave xor_nn function. The code also includes a sigmoid function:

import numpy as np
import math

def sigmoid(x):
        return 1.0 / (1.0 + np.exp(-x))

def xor_nn(XOR, THETA1, THETA2, init_w=0, learn=0, alpha=0.01):
        if init_w == 1:
                THETA1 = 2*np.random.random([2,3]) - 1
                THETA2 = 2*np.random.random([1,3]) - 1

        T1_DELTA = np.zeros(THETA1.shape)
        T2_DELTA = np.zeros(THETA2.shape)
        m = 0
        J = 0.0

        for x in XOR:
                A1 = np.vstack(([1], np.transpose(x[0:2][np.newaxis])))
                Z2 = np.dot(THETA1, A1)
                A2 = np.vstack(([1], sigmoid(Z2)))
                Z3 = np.dot(THETA2, A2)
                h = sigmoid(Z3)

                J = J + (x[2] * math.log(h[0])) + ((1 - x[2]) * math.log(1 - h[0]));
                m = m + 1;
                if learn == 1:
                        delta3 = h - x[2]
                        delta2 = (np.dot(np.transpose(THETA2), delta3) * (A2 * (1 - A2)))[1:]
                        T2_DELTA = T2_DELTA + np.dot(delta3, np.transpose(A2))
                        T1_DELTA = T1_DELTA + np.dot(delta2, np.transpose(A1))

        J = J / -m

        if learn == 1:
                THETA1 = THETA1 - (alpha * (T1_DELTA / m))
                THETA2 = THETA2 - (alpha * (T2_DELTA / m))

        return (THETA1, THETA2)

(This code is available on Github if you want to download it: Python NN on GitHub)

If you want more detail on how this function works, have a look back at Part 1, Part 2 and Part 3 of the series on the Octave version.

Comparing Python and Octave

To be sure that they both operate identically, I first generated some random numbers. These numbers were used to initialize the parameters (THETA1 and THETA2) in both functions. If you run several epochs (I ran 1000), then you will see that the values of THETA1 and THETA2 remain identical in Octave and Python. This makes sense as the only non-deterministic part of the algorithm is the initialization of the network’s parameters (i.e. the weights).

So we can be sure that both functions are executing the same algorithm.

The next step is to test how fast the networks run a large number of epochs. On my (ageing) MacBook, the Octave code runs 1000 epochs in about 9.5 seconds on average, while the Python code runs the same number in just 5.4 seconds. This is a pretty good performance improvement for what is practically the same code.

So if you are familiar with Python and want to start developing your own neural networks, then NumPy will give you the tools you need.



A Simple Neural Network in Octave – Part 3

This is the final post in a short series looking at implementing a small neural network to solve the XOR problem in Octave.

In the first post, we looked at how to set up the parameters of the network (the weights between the nodes), feed in an example and get the network to predict some output. The second post looked at implementing the back propagation algorithm, which adjusts those parameters in the network to improve the accuracy of it’s outputs.

Now it’s time to pull those pieces together so that we can let the network adjust the parameters to get as close to the right output as we desire.

First, we’ll set up a function to run all the training examples through the network. Presenting all the training examples once (one after the other) is called an epoch. The goal of this function will be to return the updated parameters after an epoch:

function [THETA1_new, THETA2_new] = xor_nn(XOR, THETA1, THETA2, init_w=0, learn=0, alpha=0.01)

Just a quick note on the notation and parameters in this function header. The first part after the word “function” specifies what this function returns. In our case, we want new versions of the parameters in the network (represented by THETA1 and THETA2).

The name of the function is “xor_nn” and the parameters to the function are contained within the brackets:

  • XOR This is the representation of the training set.
  • THETA1 & THETA2 Current values for the parameters in the network.
  • init_w=0 This tells the function to initialise the weights in the network
  • learn=0 This tells the network to learn from the examples (see below)
  • alpha=0.01 This is the learning rate (default value is 0.01)

Note that any parameter that has an “=” sign is optional and will get the default value shown if it is not provided explicitly.

The first step in our function is to check if we need to initialize the weights.

if (init_w == 1)
    THETA1 = 2*rand(2,3) - 1;
    THETA2 = 2*rand(1,3) - 1;

This is simply the same code as before, inside an “if” block.

Now we initialize the cost variable. This is done for every epoch:

J = 0.0;

In the previous post, we looked at updating the weights in the network after calculating the cost after the network has processed a single training example. This is a specific type of learning called “online learning”. There is another type of learning called “batch learning”, and this works by updating the weights once after all the training examples have been processed. Let’s implement that instead in our function.

We need to record the number of training examples and the total delta across all the training examples (rather than just across one as before). So here are the extra variables required:

T1_DELTA = zeros(size(THETA1));
T2_DELTA = zeros(size(THETA2));
m = 0;

Remember that THETA1 and THETA2 are matrices, so we need to initialize every element of the matrix, and make our delta matrices the same size. We’ll also use “m” to record the number of training examples that we present to the network in the epoch.

Now lets set up a loop to present those training examples to the network one by one:

for i = 1:rows(XOR)

This simply says repeat the following block for the same number of rows that exist in our XOR data set.

Now put in the code from Part 1 that processes an input:

A1 = [1; XOR(i,1:2)'];
Z2 = THETA1 * A1;
A2 = [1; sigmoid(Z2)];
Z3 = THETA2 * A2;
h = sigmoid(Z3);
J = J + ( XOR(i,3) * log(h) ) + ( (1 - XOR(i,3)) * log(1 - h) );
m = m + 1;

Note the slight change in moving the bias node from the layer input calculation (Z3) to the output from the previous layer (A2). This just makes the code slightly simpler.

Then we add the code from Part 2, inside a test to see if we are in learning mode. The code has been slightly modified as we are implementing batch learning rather than online learning. In batch mode, we want to calculate the errors across all the examples:

if (learn == 1)
    delta3 = h - XOR(i,3);
    delta2 = ((THETA2' * delta3) .* (A2 .* (1 - A2)))(2:end);
    T2_DELTA = T2_DELTA + (delta3 * A2');
    T1_DELTA = T1_DELTA + (delta2 * A1');
    disp('Hypothesis for '), disp(XOR(i,1:2)), disp('is '), disp(h);

If we’re not learning from this example, then we simply display the cost of this particular example.

That’s the end of the loop, so we close the block in Octave with:


Now we calculate the average cost across all the examples:

J = J / -m;

This gives us a useful guide to see if the cost is reducing each time we run the function (which it should!).

Now we’ll update the weights in the network, remembering to divide by the number of training examples as we are in batch mode:

if (learn==1)
    THETA1 = THETA1 - (alpha * (T1_DELTA / m));
    THETA2 = THETA2 - (alpha * (T2_DELTA / m));
    disp('J: '), disp(J);

Lastly, we’ll set the return values as our new updated weights:

THETA1_new = THETA1;
THETA2_new = THETA2;

And ending the function:


So that’s all that’s required to run all the training examples through the network. If you run the function a few times, you should see the cost of the network reducing as we expect it to:


If you run the function as shown, you’ll see that the cost (J) reduces each time, but not by a lot. It would be a bit boring to have to run the function each time, so let’s set up a short script to run it many times:

XOR = [0,0,0; 0,1,1; 1,0,1; 1,1,0];
THETA1 = 0;
THETA2 = 0;

[THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2, 1, 1, 0.01);
for i = 1:100000
    [THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2, 0, 1, 0.01);
    if (mod(i,1000) == 0)
        disp('Iteration : '), disp(i)
        [THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2);

There’s not a lot here that’s new. The first call to the xor_nn function initialises the weights. The loop calls the function 100,000 times, printing out the cost every 1000 iterations. That may seem like a lot of function calls, but remember that the network weights are being adjusted by a small amount each time.

If you run that script, you should see something like this as the output:


When the loop gets to the end, the output will be close to this (the exact numbers may be different):


As you can see, the network guesses small numbers (close to 0) for the first and last XOR examples and high (close to 1) for the two middle examples. This is close to what we want the network to do, so we’ve successfully trained this particular network to recognise the XOR function.

If you wish to download the code directly, it’s available here on Github:


There are a lot of things that we should do to make sure that the algorithm is optimised, including checking that the default learning rate of 0.01 is actually the right rate. But that is a job for another day.

Now that you’ve seen a simple neural network recognizing patterns and learning from examples, you can implement your own in Octave for other more interesting problems.

A Simple Neural Network in Octave – Part 2

In the last post in this short series, we looked at how to build a small neural network to solve the XOR problem. The network we built can generate an output for any of the inputs that we gave it from the table. It’s most likely that it sometimes gets the wrong answer, because the weights on the links were randomly generated.

Now we’ll look at getting the network to learn from it’s mistakes so that it gets better each time. In the context of neural networks, learning means adjusting those weights so it gets better and better at giving the right answers.

The first step is to find out how wrong the network is. This is called the cost or loss of the network. There are several ways to do this from the simple to the complex. A common approach is called “logistic regression”, which avoids problems of the neural network getting stuck in it’s learning.

Recall from our XOR table that there are two possible outputs, either 1 or 0. There will be two different versions of the cost depending on the output:

lr cost function

Remember that hθ(x) is the hypothesis that the network produces, given the input x=(x1,x2) and the weights θ. “log” is the standard mathematical function that calculates the natural logarithm of the value given (hence the name “logistic regression”).

We can combine these two instances together in a single formula that makes it easier and faster to translate into code:

lr cost function simple

So our code to represent this:

y = 0;
J = ((y * log(h)) + ((1 - y) * log(1 - h))) * -1 ;

This gives us a good idea how the network is performing on this particular input, with y set to the expected output from the first row of the XOR table.

A better network will have a lower cost, so our goal in training the network is to minimize the cost. The algorithm used to do that is called the back propagation algorithm (or backprop for short).

The first step is to evaluate the error at the output node (layer 3):


Note that the superscript 3 refers to the layer and doesn’t mean cubed! The code in Octave is:

delta3 = h - y;

So far, so straightforward. Backprop takes the error at a particular layer within a network and propagates it backward through the network. In general, the formula for that is:

nn node backprop calc

This looks a little complicated so let’s dissect it to make it easier to understand and translate into Octave code.

Firstly, the l superscript on some of the terms in the formula refers to the layer in the network. The next layer we need to work out is layer 2, so l = 2. We already know what δ3 is from the previous code above.

Next, Θ(l-1) refers to the matrix of weights at the layer l-1. The T superscript means to transpose the matrix (basically swap the rows and columns). This makes it easier for us to multiple the two matrices. Octave being what it is makes this very easy for us, so the first part of the formula is easy to translate:

THETA2’ * delta3

Note the apostrophe after THETA2. This is Octave’s way of transposing a matrix.

The second part of the formula represents the derivative of the activation function. Fortunately, there is an easy way to calculate that without getting into the complexity of calculus:


The “.*” means element-wise multiplication of a matrix. So each term in the first matrix is multiplied by the corresponding term in the second matrix. In Octave, it’s almost exactly the same (for layer 2):

Z2 .* (1 - Z2)

Putting those two parts together then:

delta2 = ((THETA2' * delta3) .* (Z2 .* (1 – Z2)))(2:end);

Remember that these are matrix operations, so delta2 will be a matrix too. The last part of the line of code (2:end) means to take the second element of the matrix to the end. The reason this is done is because we are not propagating an error back from the bias node.

At this point, we have the error in the output at layer 3 and layer 2. Since layer 1 is the input layer, there isn’t any error there, so we don’t need to calculate anything.

The last step in the backprop is to adjust the weights now that we know those errors. This is also quite simple:

weight adj

The formula says that the change to the weights Θ(l) is the activation values multiplied by the errors, adjusted by the value α, which represents the learning rate. The learning rate governs how large the adjustments to the weights are. It’s a bit of a misnomer in that it doesn’t relate to how “fast” the network learns, although that can be the effect of having a relatively larger value. For solving the XOR problem, a learning rate of 0.01 is sufficient.

So in Octave:

THETA2 = THETA2 - (0.01 * (delta3 * A2'));
THETA1 = THETA1 - (0.01 * (delta2 * A1'));

Once we’ve adjusted the weights in the matrices, the network will give us a slightly different hypothesis for each of our inputs. We can recalculate the cost function above to see how much improvement there is.

If we repeat the calculations, the cost should start reducing and the network will get better and better at producing the desired output.


In the next post, we’ll pull all the code together and see what happens when we start training the network across all the examples.

The last part of this series is here.

A Simple Neural Network In Octave – Part 1

Getting started with neural networks can seem to be a daunting prospect, even if you have some programming experience. The many examples on the Internet dive straight into the mathematics of what the neural network is doing or are full of jargon that can make it a little difficult to understand what’s going on, not to mention how to implement it in actual code.

So I decided to write a post to help myself understand the mechanics and it turns out that it will require a few parts to get through it!

Anyway, to start with, there is a great free numerical computation package called Octave that you can use to play around with Machine Learning concepts. Octave itself does not know about neural networks, but it does know how to do fast matrix multiplication. This important feature of Octave will be made clear later.

You can find out more and download the Octave software here:


A nice toy problem to start with is the XOR problem. XOR means “exclusive OR’ and it is best explained in a table:

Given this input

Produce this output


x2 y










1 1


What the table shows is that there are two inputs (labelled x1 and x2) and one output (labelled y). When x1 and x2 are both set to 0, the output we expect is also 0. Similarly, when x1 and x2 are both set to 1, the output is also 0. However, when x1 and x2 are set to different inputs, then the output will be 1.

The challenge is to build a neural network that can successfully learn to produce the correct output given the four different inputs in the table.

Let’s have a quick look at a graphical representation of the problem:

xor graph

The graph shows the two inputs x1 and x2 on their respective axes. Where x1 and x2 have the same value, the graph shows yellow circles and similarly where x1 and x2 are different, the graph shows blue circles.

There’s an important constraint that the graph shows clearly. It isn’t possible to draw a single straight line across the graph so that the yellow circles are on one side and the blue circles are on the other side. This is called “linear-separability”. So the XOR problem is not linearly separable, which means that we are going to need a multi-layer neural network to solve it.

The diagram below shows a typical configuration for a neural network that can be trained to solve the XOR problem.


There are a number of things to note about about this particular network. Firstly, the inputs in the table above (x1 and x2), are mapped directly onto the nodes represented by a1 and a2. Secondly, this first layer of nodes also contains a bias node, that has its output always set to +1. Thirdly, the nodes in the middle of the diagram are collectively called the hidden nodes or hidden layer. It also contains a bias node set to +1. The outputs of the other nodes are labelled with a small greek letter sigma, which will become clearer below. Lastly, the output of the network is labelled h.

It’s useful to represent the inputs as a vector (a one-dimensional matrix) that looks like this:


This can be translated directly into Octave as:

A1 = [1; 0; 0];

In the above example, 1 represents the bias node, and the two zeros represent the first row of the variables from our table above replacing the a1 and a2 in the vector. You can replace the two zeros with values from other rows on the table to see what happens to the output after we’ve built up the network.

The links from the nodes in the first layer to the nodes in the second layer have weights associated with them, denoted by the letter theta (along with a superscript 1) in the diagram of our network. Similarly, the weights in layer 2 are also shown with a superscript 2.

The subscript numbers identify the nodes at either end of the link, in the form (i,j), where i is the node receiving the signal and j is the node sending the signal. This is slightly counter-intuitive, where the normal expectation is that the signal moves from i to j. The reason for this is that it is possible to represent all the weights at a given layer in a single matrix that looks like this:


Typically, the initial values of the weights in a network are set to random values between -1 and +1, since we have no idea what they actually should be. We can do this in Octave as follows:

THETA1 = 2*rand(2,3) - 1;

And for Θ2, which has 3 nodes linked to the one final node:

THETA2 = 2*rand(1,3) - 1;

So now we can simply multiply those matrices together to work out what the input to the second layer is:


where Z2 represents the input to the second layer of nodes. This multiplication will result in another vector (1 dimensional matrix). Our Octave code for this is simply:

Z2 = THETA1 * A1;

This is a lot better (and faster) than having to calculate each of these inputs separately. In fact, most machine learning libraries will provide fast matrix multiplication (and other matrix operations), precisely because it is an efficient way to model machine learning strategies.

To calculate the output of layer 2, we must apply a function to the input. The typical function used is called a sigmoid function (represented by the sigma in the network diagram) and it looks like this in Octave:

function [result] = sigmoid(x)
    result = 1.0 ./ (1.0 + exp(-x));

So the output of layer 2 is the sigmoid of the input, or


which in Octave is:

A2 = [1; sigmoid(Z2)];

Note that we add the extra 1 as the first element in the vector to represent the bias that will be needed as an input into layer 3.

We repeat the process for layer 3, multiplying the output of layer 2 by the matrix of weights for layer 2 to get the input for layer 3 and then getting the sigmoid of the result:



The output from the network is then a single value, called our hypothesis (h). This is the network’s guess at the output given it’s input. The Octave code for this is:

Z3 = THETA2 * A2;
h = sigmoid(Z3);

That’s the network fully constructed. We can put any values from the table in the front (putting them in the A1 vector) and see what the output from the network is (Hypothesis).

Here is an example of the above code running:


It is almost certain that the network will get the wrong answer (outputting a 1 when it should be outputting a 0 and vice versa). The example above shows h to be 0.31328 (you may get a completely different value), which is clearly wrong, as for a (0,0) input, we should get an output of 0.

In the next post in this series, we’ll look at how to get the network to learn from its mistakes and how it can get much better at outputting correct values.

Here is the next part of this series

Poor use for IBM’s AI Watson: Predicting Popular Sale Items

IBM’s Watson computer, Yorktown Heights, NY – Clockready

IBM’s Watson computer, Yorktown Heights, NY

Last week consumers indulged themselves in a frenzy of buying in the so-called Black Friday and Cyber Monday sales. To go with this, IBM’s Watson published a list of the “hottest 100 consumer electronics”.


The highlights were pretty obvious: Star Wars toys, Hoverboards and Nike Shoes. Nothing particularly insightful there beyond that provided by regular data mining.

This is a poor example of Artificial Intelligence. Compared with advances such as autonomous cars and virtual personal assistants, this example pales and doesn’t really add to our knowledge of how to build intelligent machines. It’s a shame as Watson is being used in far more interesting areas such as healthcare that could have much more benefits than simply pointing out the obvious.

via IBM News room – 2015-11-30 IBM’s Watson Predicts Cyber Monday’s Top Products and Trends – United States.

Playing Go probably won’t lead to better AI

The traditional Chinese board game Go. (Wikipedia)

Go is a traditional Chinese game that is played by two players and involves placing stones on a board in order to capture the opponent’s stones.

Although the rules are quite simple, and there is perfect information, the strategies involved are quite complex. The very best software available today that plays this game is not able to beat professional players for many reasons, including the number of possible moves and the large size of the board (compared to chess).

Now Demis Hassabis, who was behind DeepMind (purchased by Google in 2014), has hinted that his team has managed to get a machine to play the board game Go. Previously, the team has demonstrated an AI learning to play old video games like Breakout. The algorithms have learned to play without help better than most humans, so this team knows a thing or two about designing algorithms to learn to play games.

I’m a little skeptical though about how much conquering Go will advance the field of AI. Back in the 1990s, much excitement and publicity was generated when IBM’s Deep Blue managed to beat the reigning world chess champion, Gary Kasparov. However, this feat was more down to processing power and Deep Blue’s ability to evaluate millions of positions per second than any real advance in machine intelligence.

Certainly Kasparov didn’t play that way and so Deep Blue didn’t really give any great insights into human intelligence. While it may be interesting, beating a human at Go may well turn out to be another example of Artificial Narrow Intelligence.

via Google DeepMind Founder Says AI Machines Have Beat Board Game Go | Re/code.

Go Board Game (Wikipedia)

92% claim an understanding of Artificial Intelligence

It’s no surprise that the vast majority of people claim to understand what Artificial Intelligence is. Popular culture has been a key driver of this coupled with the recent debates on the potential benefits and threats of AI, and the visible successes of Google, Facebook and Apple.

However, there is a disconnect between what the latter have achieved and the portrayal in cinema and literature. Almost all modern AI is specifically focused on particular tasks. This is known as Artificial Narrow Intelligence. It is very effective in doing one thing, but completely useless at anything else. Imagine asking Siri to drive your car for you and you’ll get an understanding of how narrow the intelligence is.

By contrast, AI in movies or even that being discussed as an existential threat is Artificial General Intelligence. Such AGIs would be capable of independent action, motivation and autonomy. This type of AI does not exist to any great extent today and in fact seems to be as far away as when Alan Turing wrote about it over 60 years ago.

There is a good chance that the survey respondents are mistaken in their interpretation of Artificial Intelligence. Unfortunately, this mistake may turn into disillusionment when they find out just how far we have to go before there is a real HAL 9000.

Source: Don’t Be Fooled By Artificial Conventional Wisdom About Artificial Intelligence | Dominic Trigg