The Softmax Function Derivative (Part 3)

Previously I’ve shown how to work out the derivative of the Softmax Function combined with the summation function, typical in artificial neural networks.

In this final part, we’ll look at how the weights in a Softmax layer change in respect to a Loss Function. The Loss Function is a measure of how “bad” the estimate from the network is. We’ll then be modifying the weights in the network in order to improve the “Loss”, i.e. make it less bad.

The Python code is based on the excellent article by Eli Bendersky which can be found here.

Cross Entropy Loss Function

There are different kinds Cross Entropy functions depending on what kind of classification that you want your network to estimate. In this example, we’re going to use the Categorical Cross Entropy. This function is typically used when the network is required to estimate which class something belongs to, when there are many classes. The output of the Softmax Function is a vector of probabilities, each element represents the network’s estimate that the input is in that class. For example:

[0.19091352 0.20353145 0.21698333 0.23132428 0.15724743]

The first element, 0.19091352, represents the network’s estimate that the input is in the first class, and so on.

Usually, the input is in one class, and we can represent the correct class for an input as a one-hot vector. In other words, the class vector is all zeros, except for a 1 in the index corresponding to the class.

[0 0 1 0 0]

In this example, the input is in class 3, represented by a 1 in the third element.

The multi-class Cross Entropy Function is defined as follows:

-\sum_{c=1}^M=y_{o,c} \textup{ log}(S_{o,c})

where M is the number of classes, y is the one-hot vector representing the correct classification c for the observation o (i.e. the input). S is the Softmax output for the class c for the observation o. Here is some code to calculate that (which continues from my previous posts on this topic):

def x_entropy(y, S):
    return np.sum(-1 * y * np.log(S))

y = np.zeros(5)
y[2] = 1   # picking the third class for example purposes
xe = x_entropy(y, S)
print(xe)

1.5279347484961026

Cross Entropy Derivative

Just like the other derivatives we’ve looked at before, the Cross-Entropy derivative is a vector of partial derivatives with respect to it’s input:

\frac{\Delta XE}{\Delta S} = \left[  \frac{\delta XE}{\delta S_{1}} \frac{\delta XE}{\delta S_{2}} \ldots \frac{\delta XE}{\delta S_{t}}  \right]

We can make this a little simpler by observing that since Y (i.e. the ground truth classification vector) is zeros, except for the target class, c, then the Cross Entropy derivative vector is also going to be zeros, except for the class c.

To see why this is the case, let’s examine the Cross Entropy function itself. We calculate it by summing up a product. Each product is the value from Y multiplied by the log of the corresponding value from S. Since all the elements in Y are actually 0 (except for the target class, c), then the corresponding derivative will also be 0. No matter how much we change the values in S, the result will still be 0.

Therefore:

\frac{\Delta XE}{\Delta S} = \left[ \ldots \frac{\delta XE}{\delta S_{t}} \ldots \right]

We can rewrite this a little, expanding out the XE function:

\frac{\Delta XE}{\Delta S} = \left[ \ldots \frac{\delta -(Y_{c}\textup{log}(S_{c}))}{\delta S_{c}} \ldots \right]

We already know that Y_{c} is 1, so we are left with:

\frac{\Delta XE}{\Delta S} = \left[ \ldots \frac{\delta -\textup{log}(S_{c})}{\delta S_{c}} \ldots \right]

So we are just looking for the derivative of the log of S_{c}:

\frac{\Delta XE}{\Delta S} = \left[ \ldots -\frac{1}{S_{c}} \ldots \right]

The rest of the elements in the vector will be 0. Here is the code that works that out:

def xe_dir(y, S):
    return (-1 / S) * y

DXE = xe_dir(y, S)
print(DXE)

[-0.      -0.      -4.60864 -0.      -0.     ]

Bringing it all together

When we have a neural network layer, we want to change the weights in order to make the loss as small as possible. So we are trying to calculate:

\frac{\Delta XE}{\Delta W}

for each of the input instances X. Since XE is a function that depends on the Softmax function, which itself depends on the summation function in the neurons, we can use the calculus chain rule as follows:

\frac{\Delta XE}{\Delta W} = \frac{\Delta XE}{\Delta S} \cdot \frac{\Delta S}{\Delta Z} \cdot \frac{\Delta Z}{\Delta W}

In this post, we’ve calculated \frac{\Delta XE}{\Delta S} and in the previous posts, we calculated \frac{\Delta S}{\Delta Z} and \frac{\Delta Z}{\Delta W}. To calculate the overall changes to the weights, we simply carry out a dot product of all those matrices:

print(np.dot(DXE, DL_shortcut).reshape(W.shape))

[[ 0.01909135  0.09545676  0.07636541  0.02035314  0.10176572]
 [ 0.08141258 -0.07830167 -0.39150833 -0.31320667  0.02313243]
 [ 0.11566214  0.09252971  0.01572474  0.07862371  0.06289897]]

Shortcut

Now that we’ve seen how to calculate the individual parts of the derivative, we can now look to see if there is a shortcut that avoids all that matrix multiplication, especially since there are lots of zeros in the elements.

Previously, we had established that the elements in the matrix \frac{\Delta S}{\Delta W} can be calculated using:

\frac{\delta{S_{t}}}{\delta{W_{ij}}} = S_{t}(1-S_{i})x_{j}

where the input and output indices are the same, and

\frac{\delta{S_{t}}}{\delta{W_{ij}}} = S_{t}(0-S_{i})x_{j}

where they are different.

Using this result, we can see that an element in the derivative of the Cross Entropy function XE, with respect to the weights W is (swapping c for t):

\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = \frac{\delta{XE_{c}}}{\delta{S_{c}}} \cdot S_{c}(1-S_{i})x_{j}

We’ve shown above that the derivative of XE with respect to S is just -\frac{1}{S_{c}}. So each element in the derivative where i = c becomes:

\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = -\frac{1}{S_{c}} \cdot S_{c}(1-S_{i})x_{j}

This simplifies to:

\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = (S_{i}-1)x_{j}

Similarly, where i <> c:

\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = (S_{i})x_{j}

Here is the corresponding Python code for that:

def xe_dir_shortcut(W, S, x, y):
    dir_matrix = np.zeros((W.shape[0] * W.shape[1]))
    
    for i in range(0, W.shape[1]):
        for j in range(0, W.shape[0]):
            dir_matrix[(i*W.shape[0]) + j] = (S[i] - y[i]) * x[j]
                
    return dir_matrix

delta_w = xe_dir_shortcut(W, h, x, y)

Let’s verify that this gives us the same results as the longer matrix multiplication above:

print(delta_w.reshape(W.shape))

[[ 0.01909135  0.09545676  0.07636541  0.02035314  0.10176572]
 [ 0.08141258 -0.07830167 -0.39150833 -0.31320667  0.02313243]
 [ 0.11566214  0.09252971  0.01572474  0.07862371  0.06289897]]

Now we have a simple function that will calculate the changes to the weights for a seemingly complicated single-layer of a neural network.

Advertisement

A Simple Neural Network in Octave – Part 3

This is the final post in a short series looking at implementing a small neural network to solve the XOR problem in Octave.

In the first post, we looked at how to set up the parameters of the network (the weights between the nodes), feed in an example and get the network to predict some output. The second post looked at implementing the back propagation algorithm, which adjusts those parameters in the network to improve the accuracy of it’s outputs.

Now it’s time to pull those pieces together so that we can let the network adjust the parameters to get as close to the right output as we desire.

First, we’ll set up a function to run all the training examples through the network. Presenting all the training examples once (one after the other) is called an epoch. The goal of this function will be to return the updated parameters after an epoch:

function [THETA1_new, THETA2_new] = xor_nn(XOR, THETA1, THETA2, init_w=0, learn=0, alpha=0.01)

Just a quick note on the notation and parameters in this function header. The first part after the word “function” specifies what this function returns. In our case, we want new versions of the parameters in the network (represented by THETA1 and THETA2).

The name of the function is “xor_nn” and the parameters to the function are contained within the brackets:

  • XOR This is the representation of the training set.
  • THETA1 & THETA2 Current values for the parameters in the network.
  • init_w=0 This tells the function to initialise the weights in the network
  • learn=0 This tells the network to learn from the examples (see below)
  • alpha=0.01 This is the learning rate (default value is 0.01)

Note that any parameter that has an “=” sign is optional and will get the default value shown if it is not provided explicitly.

The first step in our function is to check if we need to initialize the weights.

if (init_w == 1)
    THETA1 = 2*rand(2,3) - 1;
    THETA2 = 2*rand(1,3) - 1;
endif

This is simply the same code as before, inside an “if” block.

Now we initialize the cost variable. This is done for every epoch:

J = 0.0;

In the previous post, we looked at updating the weights in the network after calculating the cost after the network has processed a single training example. This is a specific type of learning called “online learning”. There is another type of learning called “batch learning”, and this works by updating the weights once after all the training examples have been processed. Let’s implement that instead in our function.

We need to record the number of training examples and the total delta across all the training examples (rather than just across one as before). So here are the extra variables required:

T1_DELTA = zeros(size(THETA1));
T2_DELTA = zeros(size(THETA2));
m = 0;

Remember that THETA1 and THETA2 are matrices, so we need to initialize every element of the matrix, and make our delta matrices the same size. We’ll also use “m” to record the number of training examples that we present to the network in the epoch.

Now lets set up a loop to present those training examples to the network one by one:

for i = 1:rows(XOR)

This simply says repeat the following block for the same number of rows that exist in our XOR data set.

Now put in the code from Part 1 that processes an input:

A1 = [1; XOR(i,1:2)'];
Z2 = THETA1 * A1;
A2 = [1; sigmoid(Z2)];
Z3 = THETA2 * A2;
h = sigmoid(Z3);
J = J + ( XOR(i,3) * log(h) ) + ( (1 - XOR(i,3)) * log(1 - h) );
m = m + 1;

Note the slight change in moving the bias node from the layer input calculation (Z3) to the output from the previous layer (A2). This just makes the code slightly simpler.

Then we add the code from Part 2, inside a test to see if we are in learning mode. The code has been slightly modified as we are implementing batch learning rather than online learning. In batch mode, we want to calculate the errors across all the examples:

if (learn == 1)
    delta3 = h - XOR(i,3);
    delta2 = ((THETA2' * delta3) .* (A2 .* (1 - A2)))(2:end);
    T2_DELTA = T2_DELTA + (delta3 * A2');
    T1_DELTA = T1_DELTA + (delta2 * A1');
else
    disp('Hypothesis for '), disp(XOR(i,1:2)), disp('is '), disp(h);
endif

If we’re not learning from this example, then we simply display the cost of this particular example.

That’s the end of the loop, so we close the block in Octave with:

endfor

Now we calculate the average cost across all the examples:

J = J / -m;

This gives us a useful guide to see if the cost is reducing each time we run the function (which it should!).

Now we’ll update the weights in the network, remembering to divide by the number of training examples as we are in batch mode:

if (learn==1)
    THETA1 = THETA1 - (alpha * (T1_DELTA / m));
    THETA2 = THETA2 - (alpha * (T2_DELTA / m));
else
    disp('J: '), disp(J);
endif

Lastly, we’ll set the return values as our new updated weights:

THETA1_new = THETA1;
THETA2_new = THETA2;

And ending the function:

endfunction

So that’s all that’s required to run all the training examples through the network. If you run the function a few times, you should see the cost of the network reducing as we expect it to:

octave_3_1

If you run the function as shown, you’ll see that the cost (J) reduces each time, but not by a lot. It would be a bit boring to have to run the function each time, so let’s set up a short script to run it many times:

XOR = [0,0,0; 0,1,1; 1,0,1; 1,1,0];
THETA1 = 0;
THETA2 = 0;

[THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2, 1, 1, 0.01);
for i = 1:100000
    [THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2, 0, 1, 0.01);
    if (mod(i,1000) == 0)
        disp('Iteration : '), disp(i)
        [THETA1, THETA2] = xor_nn(XOR, THETA1, THETA2);
    endif
endfor

There’s not a lot here that’s new. The first call to the xor_nn function initialises the weights. The loop calls the function 100,000 times, printing out the cost every 1000 iterations. That may seem like a lot of function calls, but remember that the network weights are being adjusted by a small amount each time.

If you run that script, you should see something like this as the output:

octave_3_2

When the loop gets to the end, the output will be close to this (the exact numbers may be different):

octave_3_3

As you can see, the network guesses small numbers (close to 0) for the first and last XOR examples and high (close to 1) for the two middle examples. This is close to what we want the network to do, so we’ve successfully trained this particular network to recognise the XOR function.

If you wish to download the code directly, it’s available here on Github:

https://github.com/StephenOman/Octave/tree/master/xor%20neural%20network

There are a lot of things that we should do to make sure that the algorithm is optimised, including checking that the default learning rate of 0.01 is actually the right rate. But that is a job for another day.

Now that you’ve seen a simple neural network recognizing patterns and learning from examples, you can implement your own in Octave for other more interesting problems.

A Simple Neural Network In Octave – Part 1

Getting started with neural networks can seem to be a daunting prospect, even if you have some programming experience. The many examples on the Internet dive straight into the mathematics of what the neural network is doing or are full of jargon that can make it a little difficult to understand what’s going on, not to mention how to implement it in actual code.

So I decided to write a post to help myself understand the mechanics and it turns out that it will require a few parts to get through it!

Anyway, to start with, there is a great free numerical computation package called Octave that you can use to play around with Machine Learning concepts. Octave itself does not know about neural networks, but it does know how to do fast matrix multiplication. This important feature of Octave will be made clear later.

You can find out more and download the Octave software here:

https://www.gnu.org/software/octave/

A nice toy problem to start with is the XOR problem. XOR means “exclusive OR’ and it is best explained in a table:

Given this input

Produce this output

x1

x2 y

0

0

0

0

1

1

1

0

1

1 1

0

What the table shows is that there are two inputs (labelled x1 and x2) and one output (labelled y). When x1 and x2 are both set to 0, the output we expect is also 0. Similarly, when x1 and x2 are both set to 1, the output is also 0. However, when x1 and x2 are set to different inputs, then the output will be 1.

The challenge is to build a neural network that can successfully learn to produce the correct output given the four different inputs in the table.

Let’s have a quick look at a graphical representation of the problem:

xor graph

The graph shows the two inputs x1 and x2 on their respective axes. Where x1 and x2 have the same value, the graph shows yellow circles and similarly where x1 and x2 are different, the graph shows blue circles.

There’s an important constraint that the graph shows clearly. It isn’t possible to draw a single straight line across the graph so that the yellow circles are on one side and the blue circles are on the other side. This is called “linear-separability”. So the XOR problem is not linearly separable, which means that we are going to need a multi-layer neural network to solve it.

The diagram below shows a typical configuration for a neural network that can be trained to solve the XOR problem.

network

There are a number of things to note about about this particular network. Firstly, the inputs in the table above (x1 and x2), are mapped directly onto the nodes represented by a1 and a2. Secondly, this first layer of nodes also contains a bias node, that has its output always set to +1. Thirdly, the nodes in the middle of the diagram are collectively called the hidden nodes or hidden layer. It also contains a bias node set to +1. The outputs of the other nodes are labelled with a small greek letter sigma, which will become clearer below. Lastly, the output of the network is labelled h.

It’s useful to represent the inputs as a vector (a one-dimensional matrix) that looks like this:

A1vector

This can be translated directly into Octave as:

A1 = [1; 0; 0];

In the above example, 1 represents the bias node, and the two zeros represent the first row of the variables from our table above replacing the a1 and a2 in the vector. You can replace the two zeros with values from other rows on the table to see what happens to the output after we’ve built up the network.

The links from the nodes in the first layer to the nodes in the second layer have weights associated with them, denoted by the letter theta (along with a superscript 1) in the diagram of our network. Similarly, the weights in layer 2 are also shown with a superscript 2.

The subscript numbers identify the nodes at either end of the link, in the form (i,j), where i is the node receiving the signal and j is the node sending the signal. This is slightly counter-intuitive, where the normal expectation is that the signal moves from i to j. The reason for this is that it is possible to represent all the weights at a given layer in a single matrix that looks like this:

Theta1

Typically, the initial values of the weights in a network are set to random values between -1 and +1, since we have no idea what they actually should be. We can do this in Octave as follows:

THETA1 = 2*rand(2,3) - 1;

And for Θ2, which has 3 nodes linked to the one final node:

THETA2 = 2*rand(1,3) - 1;

So now we can simply multiply those matrices together to work out what the input to the second layer is:

Z_2

where Z2 represents the input to the second layer of nodes. This multiplication will result in another vector (1 dimensional matrix). Our Octave code for this is simply:

Z2 = THETA1 * A1;

This is a lot better (and faster) than having to calculate each of these inputs separately. In fact, most machine learning libraries will provide fast matrix multiplication (and other matrix operations), precisely because it is an efficient way to model machine learning strategies.

To calculate the output of layer 2, we must apply a function to the input. The typical function used is called a sigmoid function (represented by the sigma in the network diagram) and it looks like this in Octave:

function [result] = sigmoid(x)
    result = 1.0 ./ (1.0 + exp(-x));
end

So the output of layer 2 is the sigmoid of the input, or

A_2

which in Octave is:

A2 = [1; sigmoid(Z2)];

Note that we add the extra 1 as the first element in the vector to represent the bias that will be needed as an input into layer 3.

We repeat the process for layer 3, multiplying the output of layer 2 by the matrix of weights for layer 2 to get the input for layer 3 and then getting the sigmoid of the result:

Z_3

hypothesis

The output from the network is then a single value, called our hypothesis (h). This is the network’s guess at the output given it’s input. The Octave code for this is:

Z3 = THETA2 * A2;
h = sigmoid(Z3);

That’s the network fully constructed. We can put any values from the table in the front (putting them in the A1 vector) and see what the output from the network is (Hypothesis).

Here is an example of the above code running:

temp

It is almost certain that the network will get the wrong answer (outputting a 1 when it should be outputting a 0 and vice versa). The example above shows h to be 0.31328 (you may get a completely different value), which is clearly wrong, as for a (0,0) input, we should get an output of 0.

In the next post in this series, we’ll look at how to get the network to learn from its mistakes and how it can get much better at outputting correct values.

Here is the next part of this series

Machine Learning for the masses: Google’s TensorFlow

Google TensorFlowThe Artificial Intelligence community was abuzz recently with the news that Google has open-sourced it’s machine learning framework, called TensorFlow. This system was created by the Google Brain Team, working in it’s Machine Intelligence Research group.

This is not the first open source machine learning framework. Within the Python environment in particular, there are frameworks such as scikit-learn, PyBrain and others that have been around for a good while. What’s different about this new framework is that it has the backing of one of the most advanced commercial machine learning organisations, Google. In committing the project to open-source, it is inviting researchers, commercial practitioners and hobbyists to contribute to the framework. With Google’s backing, it seems destined for a long life.

But back to today. The framework has both Python and C++ APIs, with the expectation that C++ will be slightly faster on certain tasks. The instructions for installing TensorFlow are straightforward, but immediately I ran into a problem. My (slightly ageing) MacBook was running Python 2.7.5 and running TensorFlow caused a segmentation fault. Updating to Python 2.7.10 fixed the problem and I was able to successfully run though some of the tutorials.

There seems to be a wide range of neural network capabilities already available within the framework which provides much opportunity for exploration and experimentation. The tutorials cover areas such as handwriting recognition, image classification (using convolutional neural networks) and language modelling (using recurrent neural networks).

What’s also interesting is that since it’s an open source framework, the underlying code behind all these machine learning techniques is available for anyone to download, examine, modify and improve.

What will be the long-term impact of this is hard to tell. However, it is clear that Google has already put in quite a bit of effort already effort into this framework, and now that it’s out in the open, there will be lots more improvement to come.

If you want to know more and perhaps even try it out yourself, you can download TensorFlow here.

Why Businesses Embrace Machine Learning

Scientific American has published an excerpt from The Master Algorithm, by Pedro Domingos that’s worth a read.

Given that we are living in the Age of the Algorithm, it’s worth knowing just how pervasive they are. Every post on Facebook you see, every search result from Google, every advert on every website: all are controlled by invisible – and unaccountable – algorithms.

Once the inevitable happens and learning algorithms become the middlemen, power becomes concentrated in them. Google’s algorithms largely determine what information you find, Amazon’s what products you buy, and Match.com’s who you date.

Power will accrue to those who have the best algorithms. If you want to know why and how, Domingo’s book is definitely one to add to your Christmas booklist.

via Why Businesses Embrace Machine Learning [Excerpt] – Scientific American.

Data Mining Algorithms Explained In Plain Language

Here’s a really great resource. Raymond Li from Microsoft has written an explanation of the top 10 data mining algorithms, but in plain language. These algorithms are used a lot in machine learning too.

So if you are confused about Naive Bayes or Support Vector Machines, then take a look at Ray’s easy to understand explanations.

Top 10 data mining algorithms in plain English | rayli.net.

Ambient Intelligence

This is an interesting idea: Ambient Intelligence.

“an ever-present digital fog in tune with our behavior and physiological state”

As AI slowly improves in particular domains, we will see the techniques and algorithms incorporated into previously static control systems.

For example, your household heating system could learn the best way to adjust the temperature in the most efficient way, having learned what you like. Other appliances will have some form of learning built in too.

So what we will see will be the widespread adoption of Artificial Narrow Intelligence. There won’t be one Artificial General Intelligence in your house, just lots of narrow ones.

The problem I see with this is the lack of serendipity. What a bland world it would be if we were surrounded by devices which made the world just perfect for us with no surprises. How would we ever be enticed to step outside our comfort zones? We would replicate our “echo chamber” online experience in the real world.

Here’s the full article by Neil Howe (@lifecourse):

Artificial Intelligence Paves The Way For Ambient Intelligence – Forbes.

The Machine Vision Algorithm Beating Art Historians at Their Own Game

This is an interesting application of image processing. The machine learning algorithms are trained on a subset of paintings taken from a data set of more than 80,000. The resulting feature set has over 400 dimensions.

When presented with a painting it has not seen before, it correctly guessed the artist more than 60% of the time. It has also detected additional links between different styles and periods:

It links expressionism and fauvism, which might be expected given that the latter movement is often thought of as a type of expressionism. It links the mannerist and Renaissance styles, which clearly reflects that fact that mannerism is a form of early Renaissance painting.

However, it also apparently confuses certain styles:

… it often confuses examples of abstract expressionism and action paintings, in which artists drip or fling paint and step on the canvas. Saleh and Elgammal [the creators of the ML algorithms] … say that this kind of mix-up would be entirely understandable for a human viewer. “’Action painting’ is a type or subgenre of “abstract expressionism,’” they point out.

Of course, this could also mean that the machine is correct and different “genres” of abstact paintings are completely arbitrary. But what is does highlight is that machine learning has a way to go before it can start offering subjective opinions.

via The Machine Vision Algorithm Beating Art Historians at Their Own Game | MIT Technology Review.