# Solving MNIST with a Neural Network from the ground up

I’ve written before about the benefits of reinventing the wheel and this is one of those occasions where it was definitely worth the effort. Sometimes, there is just no substitute for trying to implement an algorithm to really understand what’s going on under the hood. This is especially true when learning about artificial neural networks. Sure, there are plenty of frameworks available that you can use which implement any flavour of neural network, complete with a dazzling arrays of optimisations, activations and loss functions. That may solve your problem, but it abstracts away a lot of the details about why it solves it.

MNIST is a great dataset to start with. It’s a collection of images containing 60,000 handwritten digits. It also contains a further 10,000 images that can be used as the test set. It’s been well studied and most frameworks have sample implementations. Here’s an example image:

You can find the full dataset of images on Yann Le Cun’s website.

While it’s useful to reinvent the wheel, we should at least learn from those that have already built wheels before. The first thing I borrowed was the network architecture from TensorFlow. Their example has:

• 28×28 input
• a hidden layer with 512 neurons with ReLU activation
• an output layer with 10 neutrons (representing the 10 possible digits) with Softmax activation
• Cross-Entropy loss function

The next thing to work on was the feedforward part of the network. This is relatively straightforward as these functions are well documented online and the network itself isn’t complicated.

The tough part was working through the back-propagation algorithm. In a previous post, I detailed how to work out the derivatives of the Softmax function and the Cross Entropy loss. The most obvious way is to use the Chain Rule in Differential Calculus to work out the gradients and propagate them back through the network. The steps are pleasing to my eye and appeal to my sense of order in code. (Tip: Use a spreadsheet on a small example network to see the actual matrices in action.)

But (and it’s a big but), the basic approach uses Jacobian matrices. Each cell in these kind of matrices is a partial derivative; each matrix represents a change in every variable with respect to every output. As a result, they can grow rather large very quickly. We run into several issues multiplying very large matrices together. In the notebook, I’ve left the functions representing this approach in for comparison and if you do run it, you’ll notice immediately the problems with speed and memory.

Luckily there are shortcuts, which mean that we can directly calculate the gradients without resorting to Jacobian matrix multiplication. You can see these in the Short Form section of the notebook. In a sense though, these are abstractions too and it’s difficult to see the back-propagation from the shortcut methods.

Lastly, I’ve implemented some code to gather the standard metrics for evaluating how good a machine learning model is. I’ve run it several times and it usually gets an overall accuracy score of between 92% and 95% on the MNIST test dataset.

One of the main things I learned from this exercise is that the actual coding of a network is relatively simple. The really hard part that took a while was figuring out the calculus and especially the shortcuts. I really appreciate now why those frameworks are popular and make coding neural networks so much easier.

If you fancy a challenge, I can recommend working on a neural network from first principles. You never know what you might learn!

# The Softmax Function Derivative (Part 3)

Previously I’ve shown how to work out the derivative of the Softmax Function combined with the summation function, typical in artificial neural networks.

In this final part, we’ll look at how the weights in a Softmax layer change in respect to a Loss Function. The Loss Function is a measure of how “bad” the estimate from the network is. We’ll then be modifying the weights in the network in order to improve the “Loss”, i.e. make it less bad.

The Python code is based on the excellent article by Eli Bendersky which can be found here.

## Cross Entropy Loss Function

There are different kinds Cross Entropy functions depending on what kind of classification that you want your network to estimate. In this example, we’re going to use the Categorical Cross Entropy. This function is typically used when the network is required to estimate which class something belongs to, when there are many classes. The output of the Softmax Function is a vector of probabilities, each element represents the network’s estimate that the input is in that class. For example:

[0.19091352 0.20353145 0.21698333 0.23132428 0.15724743]


The first element, 0.19091352, represents the network’s estimate that the input is in the first class, and so on.

Usually, the input is in one class, and we can represent the correct class for an input as a one-hot vector. In other words, the class vector is all zeros, except for a 1 in the index corresponding to the class.

[0 0 1 0 0]


In this example, the input is in class 3, represented by a 1 in the third element.

The multi-class Cross Entropy Function is defined as follows: $-\sum_{c=1}^M=y_{o,c} \textup{ log}(S_{o,c})$

where M is the number of classes, y is the one-hot vector representing the correct classification c for the observation o (i.e. the input). S is the Softmax output for the class c for the observation o. Here is some code to calculate that (which continues from my previous posts on this topic):

def x_entropy(y, S):
return np.sum(-1 * y * np.log(S))

y = np.zeros(5)
y = 1   # picking the third class for example purposes
xe = x_entropy(y, S)
print(xe)

1.5279347484961026


## Cross Entropy Derivative

Just like the other derivatives we’ve looked at before, the Cross-Entropy derivative is a vector of partial derivatives with respect to it’s input: $\frac{\Delta XE}{\Delta S} = \left[ \frac{\delta XE}{\delta S_{1}} \frac{\delta XE}{\delta S_{2}} \ldots \frac{\delta XE}{\delta S_{t}} \right]$

We can make this a little simpler by observing that since Y (i.e. the ground truth classification vector) is zeros, except for the target class, c, then the Cross Entropy derivative vector is also going to be zeros, except for the class c.

To see why this is the case, let’s examine the Cross Entropy function itself. We calculate it by summing up a product. Each product is the value from Y multiplied by the log of the corresponding value from S. Since all the elements in Y are actually 0 (except for the target class, c), then the corresponding derivative will also be 0. No matter how much we change the values in S, the result will still be 0.

Therefore: $\frac{\Delta XE}{\Delta S} = \left[ \ldots \frac{\delta XE}{\delta S_{t}} \ldots \right]$

We can rewrite this a little, expanding out the XE function: $\frac{\Delta XE}{\Delta S} = \left[ \ldots \frac{\delta -(Y_{c}\textup{log}(S_{c}))}{\delta S_{c}} \ldots \right]$

We already know that $Y_{c}$ is 1, so we are left with: $\frac{\Delta XE}{\Delta S} = \left[ \ldots \frac{\delta -\textup{log}(S_{c})}{\delta S_{c}} \ldots \right]$

So we are just looking for the derivative of the log of $S_{c}$: $\frac{\Delta XE}{\Delta S} = \left[ \ldots -\frac{1}{S_{c}} \ldots \right]$

The rest of the elements in the vector will be 0. Here is the code that works that out:

def xe_dir(y, S):
return (-1 / S) * y

DXE = xe_dir(y, S)
print(DXE)

[-0.      -0.      -4.60864 -0.      -0.     ]


## Bringing it all together

When we have a neural network layer, we want to change the weights in order to make the loss as small as possible. So we are trying to calculate: $\frac{\Delta XE}{\Delta W}$

for each of the input instances X. Since XE is a function that depends on the Softmax function, which itself depends on the summation function in the neurons, we can use the calculus chain rule as follows: $\frac{\Delta XE}{\Delta W} = \frac{\Delta XE}{\Delta S} \cdot \frac{\Delta S}{\Delta Z} \cdot \frac{\Delta Z}{\Delta W}$

In this post, we’ve calculated $\frac{\Delta XE}{\Delta S}$ and in the previous posts, we calculated $\frac{\Delta S}{\Delta Z}$ and $\frac{\Delta Z}{\Delta W}$. To calculate the overall changes to the weights, we simply carry out a dot product of all those matrices:

print(np.dot(DXE, DL_shortcut).reshape(W.shape))

[[ 0.01909135  0.09545676  0.07636541  0.02035314  0.10176572]
[ 0.08141258 -0.07830167 -0.39150833 -0.31320667  0.02313243]
[ 0.11566214  0.09252971  0.01572474  0.07862371  0.06289897]]


## Shortcut

Now that we’ve seen how to calculate the individual parts of the derivative, we can now look to see if there is a shortcut that avoids all that matrix multiplication, especially since there are lots of zeros in the elements.

Previously, we had established that the elements in the matrix $\frac{\Delta S}{\Delta W}$ can be calculated using: $\frac{\delta{S_{t}}}{\delta{W_{ij}}} = S_{t}(1-S_{i})x_{j}$

where the input and output indices are the same, and $\frac{\delta{S_{t}}}{\delta{W_{ij}}} = S_{t}(0-S_{i})x_{j}$

where they are different.

Using this result, we can see that an element in the derivative of the Cross Entropy function XE, with respect to the weights W is (swapping c for t): $\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = \frac{\delta{XE_{c}}}{\delta{S_{c}}} \cdot S_{c}(1-S_{i})x_{j}$

We’ve shown above that the derivative of XE with respect to S is just $-\frac{1}{S_{c}}$. So each element in the derivative where i = c becomes: $\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = -\frac{1}{S_{c}} \cdot S_{c}(1-S_{i})x_{j}$

This simplifies to: $\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = (S_{i}-1)x_{j}$

Similarly, where i <> c: $\frac{\delta{XE_{c}}}{\delta{W_{ij}}} = (S_{i})x_{j}$

Here is the corresponding Python code for that:

def xe_dir_shortcut(W, S, x, y):
dir_matrix = np.zeros((W.shape * W.shape))

for i in range(0, W.shape):
for j in range(0, W.shape):
dir_matrix[(i*W.shape) + j] = (S[i] - y[i]) * x[j]

return dir_matrix

delta_w = xe_dir_shortcut(W, h, x, y)


Let’s verify that this gives us the same results as the longer matrix multiplication above:

print(delta_w.reshape(W.shape))

[[ 0.01909135  0.09545676  0.07636541  0.02035314  0.10176572]
[ 0.08141258 -0.07830167 -0.39150833 -0.31320667  0.02313243]
[ 0.11566214  0.09252971  0.01572474  0.07862371  0.06289897]]


Now we have a simple function that will calculate the changes to the weights for a seemingly complicated single-layer of a neural network.

# The Softmax Function Derivative (Part 2)

In a previous post, I showed how to calculate the derivative of the Softmax function. This function is widely used in Artificial Neural Networks, typically in final layer in order to estimate the probability that the network’s input is in one of a number of classes.

In this post, I’ll show how to calculate the derivative of the whole Softmax Layer rather than just the function itself.

The Python code is based on the excellent article by Eli Bendersky which can be found here.

## The Softmax Layer

A Softmax Layer in an Artificial Neural Network is typically composed of two functions. The first is the usual sum of all the weighted inputs to the layer. The output of this is then fed into the Softmax function which will output the probability distribution across the classes we are trying to predict. Here’s an example with three inputs and five classes:

For a given output zi, the calculation is very straightforward: $z_{i}=w_{i1}x_{1}+w_{i2}x_{2}+...+w_{in}x_{n}$

We simply multiply each input to the node by it’s corresponding weight. Expressing this in vector notation gives us the familiar: $\textbf{z}=\textbf{w}^{\textbf{T}}\textbf{x}$

The vector w is two dimensional so it’s actually a matrix and we can visualise the formula for our example as follows:

I’ve already covered the Softmax Function itself in the previous post, so I’ll just repeat it here for completeness: $\sigma(z_{i})=\frac{e^{z_{i}}} {\sum_{j=1}^K e^{z_{j}}}$

Here’s the python code for that:

import numpy as np

# input vector
x = np.array([0.1,0.5,0.4])

# using some hard coded values for the weights
# rather than random numbers to illustrate how
# it works
W = np.array([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.6, 0.7, 0.8, 0.9, 0.1],
[0.11, 0.12, 0.13, 0.14, 0.15]])

# Softmax function
def softmax(Z):
eZ = np.exp(Z)
sm = eZ / np.sum(eZ)
return sm

Z = np.dot(np.transpose(W), x)
h = softmax(Z)
print(h)


Which should give us the output h (the hypothesis):

[0.19091352 0.20353145 0.21698333 0.23132428 0.15724743]


## Calculating the Derivative

The Softmax layer is a combination of two functions, the summation followed by the Softmax function itself. Mathematically, this is usually written as: $h = S(Z(x))$

The next thing to note is that we will be trying to calculate the change in the hypothesis h with respect to changes in the weights, not the inputs. The overall derivative of the layer that we are looking for is: $h' = \frac{d\textbf{S}}{d\textbf{w}}$

We can use the differential chain rule to calculate the derivative of the layer as follows: $\frac{d\textbf{S}}{d\textbf{w}} = \frac{d\textbf{S}}{d\textbf{Z}} \cdot \frac{d\textbf{Z}}{d\textbf{w}}$

In the previous post, I showed how to work out dS/dZ and just for completeness, here is a short Python function to carry out the calculation:

def sm_dir(S):
S_vector = S.reshape(S.shape,1)
S_matrix = np.tile(S_vector,S.shape)
S_dir = np.diag(S) - (S_matrix * np.transpose(S_matrix))
return S_dir

DS = sm_dir(h)
print(DS)


The output of that function is a matrix as follows:

[[ 0.154465 -0.038856 -0.041425 -0.044162 -0.030020]
[-0.038856  0.162106 -0.044162 -0.047081 -0.032004]
[-0.041425 -0.044162 0.1699015 -0.050193 -0.034120]
[-0.044162 -0.047081 -0.050193  0.177813 -0.036375]
[-0.030020 -0.032004 -0.034120 -0.036375  0.132520]]


### Derivative of Z

Let’s next look at the derivative of the function Z() with respect to W, dZ/dW. We are trying to find the change in each of the elements of Z(), zk when each of the weights wij are changed.

So right away, we are going to need a matrix to hold all of those values. Let’s assume that the output vector of Z() has K elements. There are (i $\cdot$ j) individual weights in W. Therefore, our matrix of derivatives is going to be of dimensions (K, (i $\cdot$ j)). Each of the elements of the matrix will be a partial derivative of the output zk with respect to the particular weight wij: $\frac{\delta{S}}{\delta{x}} = \left [ \begin{array}{ccc} \frac{\delta{z_{1}}}{\delta{w_{11}}} & \ldots & \frac{\delta{z_{1}}} {\delta{w_{53}}} \\ \ldots & \frac{\delta{z_{k}}}{\delta{w_{ij}}} & \ldots \\ \frac{\delta{z_{K}}}{\delta{w_{11}}} & \ldots & \frac{\delta{z_{K}}} {\delta{w_{53}}} \end{array} \right ]$

Taking one of those elements, using our example above, we can see how to work out the derivative: $z_{1} = w_{11}x_{1} + w_{12}x_{2} + w_{13}x_{3}$

None of the other weights are used in z1. The partial derivative of z1 with respect to w11 is x1. Likewise, the partial derivative of z1 with respect to w12 is x2, and with respect to w13 is x3. The derivative of z1 with respect to the rest of the weights is 0.

This makes the whole matrix rather simple to derive, since it is mostly zeros. Where the elements are not zero (i.e. where i = k), then the value is xj. Here is the corresponding Python code to calculate that matrix.

# derivative of the Summation Function Z w.r.t weight matrix W given inputs x

def z_dir(Z, W, x):
dir_matrix = np.zeros((W.shape * W.shape, Z.shape))

for k in range(0, Z.shape):
for i in range(0, W.shape):
for j in range(0, W.shape):
if i == k:
dir_matrix[(i*W.shape) + j][k] = x[j]

return dir_matrix


If we use the example above, then the derivative matrix will look like this:

DZ = z_dir(Z, W, x)
print(DZ)

[[0.1 0.  0.  0.  0. ]
[0.5 0.  0.  0.  0. ]
[0.4 0.  0.  0.  0. ]
[0.  0.1 0.  0.  0. ]
[0.  0.5 0.  0.  0. ]
[0.  0.4 0.  0.  0. ]
[0.  0.  0.1 0.  0. ]
[0.  0.  0.5 0.  0. ]
[0.  0.  0.4 0.  0. ]
[0.  0.  0.  0.1 0. ]
[0.  0.  0.  0.5 0. ]
[0.  0.  0.  0.4 0. ]
[0.  0.  0.  0.  0.1]
[0.  0.  0.  0.  0.5]
[0.  0.  0.  0.  0.4]]


Going back to the formula for the derivative of the Softmax Layer: $\frac{d\textbf{S}}{d\textbf{W}} = \frac{d\textbf{S}}{d\textbf{Z}} \cdot \frac{d\textbf{Z}}{d\textbf{W}}$

We now just take the dot product of both of the derivative matrices to get the derivative for the whole layer:

DL = np.dot(DS, np.transpose(DZ))
print(DL)

[[ 0.01544  0.07723  0.06178 -0.00388 -0.01942 -0.01554
-0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766
-0.00300 -0.01501 -0.01200]
[-0.00388 -0.01942 -0.01554  0.01621  0.0810   0.06484
-0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883
-0.00320 -0.01600 -0.01280]
[-0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766
0.01699  0.08495  0.06796 -0.00501 -0.02509 -0.02007
-0.00341 -0.01706 -0.01364]
[-0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883
-0.00501 -0.02509 -0.02007  0.01778  0.08890  0.07112
-0.00363 -0.01818 -0.01455]
[-0.00300 -0.01501 -0.01200 -0.00320 -0.01600 -0.01280
-0.00341 -0.01706 -0.01364 -0.00363 -0.01818 -0.01455
0.01325  0.06626  0.05300]]


## Shortcut!

While it is instructive to see the matrices being derived explicitly, it is possible to manipulate the formulas to make it easier. Starting with one of the entries in the matrix DL, it looks like this: $\frac{\delta{s_{t}}}{\delta{w_{ij}}} = \sum_{k=1}^K D_{k}S_{t} \cdot D_{ij}Z_{k}$

Since the matrix dZ/dW is mostly zeros, then we can try to simplify it. dZ/dW is non-zero when i = k, and then it is equal to xj as we worked out above. So we can simplify the non-zero entries to: $\frac{\delta{s_{t}}}{\delta{w_{ij}}} = D_{i}S_{t}x_{j}$

In the previous post, we established that when the indices are the same, then: $D_{i}S_{j} = S_{j}(1-S_{i})$

So: $\frac{\delta{s_{t}}}{\delta{w_{ij}}} = S_{t}(1-S_{i})x_{j}$

When the indices are not the same, we use: $\frac{\delta{s_{t}}}{\delta{w_{ij}}} = S_{t}(0-S_{i})x_{j}$

What these two formulas show is that it is possible to calculate each of the entries in the derivative matrix by using only the input values X and the Softmax output S, skipping the matrix dot product altogether.

Here is the Python code corresponding to that:

def l_dir_shortcut(W, S, x):
dir_matrix = np.zeros((W.shape * W.shape, W.shape))

for t in range(0, W.shape):
for i in range(0, W.shape):
for j in range(0, W.shape):
dir_matrix[(i*W.shape) + j][t] = S[t] * ((i==t) - S[i]) * x[j]

return dir_matrix

DL_shortcut = np.transpose(l_dir_shortcut(W, h, x))


To verify that, we can cross check it with the matrix we derived from first principle:

print(DL_shortcut)

[[ 0.01544  0.07723  0.06178 -0.00388 -0.01942 -0.01554
-0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766
-0.00300 -0.01501 -0.01200]
[-0.00388 -0.01942 -0.01554  0.01621  0.08105  0.06484
-0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883
-0.00320 -0.01600 -0.01280]
[-0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766
0.01699  0.08495  0.06796 -0.00501 -0.02509 -0.02007
-0.00341 -0.01706 -0.01364]
[-0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883
-0.00501 -0.02509 -0.02007  0.01778  0.08890  0.07112
-0.00363 -0.01818 -0.01455]
[-0.00300 -0.01501 -0.01200 -0.00320 -0.01600 -0.01280
-0.00341 -0.01706 -0.01364 -0.00363 -0.01818 -0.01455
0.01325  0.06626  0.05300]]


Lastly, it’s worth noting that in order to actually modify each of the weights, we need to sum up the individual adjustments in each of the corresponding columns.

# The Softmax Function Derivative (Part 1)

### Introduction

This post demonstrates the calculations behind the evaluation of the Softmax Derivative using Python. It is based on the excellent article by Eli Bendersky which can be found here.

### The Softmax Function

The softmax function simply takes a vector of N dimensions and returns a probability distribution also of N dimensions. Each element of the output is in the range (0,1) and the sum of the elements of N is 1.0.

Each element of the output is given by the formula: $\sigma(z_{i})=\frac{e^{z_{i}}} {\sum_{j=1}^K e^{z_{j}}}$

See https://en.wikipedia.org/wiki/Softmax_function for more details.

import numpy as np
x = np.random.random()

def softmax_basic(z):
exps = np.exp(z)
sums = np.sum(exps)
return np.divide(exps, sums)

softmax_basic(x)

This should generate an output that looks something like this:

array([0.97337094, 0.85251098, 0.62495691, 0.63957056, 0.6969253 ])

We expect that the sum of those will be (close to) 1.0:

np.sum(softmax_basic(x))

And so it is:

1.0

### Calculating the derivative

We need to calculate the partial derivative of the probability outputs $S_{i}$ with respect to each of the inputs $x_{j}$. For example: $\frac{\delta{S_{i}}}{\delta{x_{j}}}=\frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}}$

Following Bendersky’s derivation, we need to use the quotient rule for derivatives: $\text{If }f(x)=\frac{g(x)}{h(x)} \text{ then:}$ $f'(x)=\frac{g'(x)h(x)-h'(x)g(x)}{[h(x)]^{2}}$

From the Softmax function: $\begin{array}{rcl} g_{i} & = & e^{x_{i}}\\ h_{i} & = & \sum_{k=1}^N e^{x_{k}} \end{array}$

The derivatives of these functions with respect to $x_{j}$ are: \frac{\delta{g_{i}}}{\delta{x_{j}}} = \left \{ \begin{aligned} &e^{x_{j}}, && \text{if}\ i=j \\ &0, && \text{otherwise} \end{aligned} \right.

and $\frac{\delta{h_{i}}}{\delta{x_{j}}}=\frac{\delta(e^{x_{1}}+e^{x_{2}}+...+e^{x_{N}})}{\delta{x_{j}}}=e^{x_{j}}$

Now we have to evalutate the quotient rule for the two seperate cases where $i=j$ and where $i\neq{j}$:

Starting with $i=j$: $\frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}}=\frac{e^{x_{j}}\sum_{k=1}^N e^{x_{k}}-e^{x_{j}}e^{x_{i}}}{\left[\sum_{k=1}^N e^{x_{k}}\right]^{2}}$

Now we’ll just simplify this a bit: $\begin{array}{rcl} \frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}} & = & \frac{e^{x_{j}}\left(\sum_{k=1}^N e^{x_{k}}-e^{x_{i}}\right)}{\left[\sum_{k=1}^N e^{x_{k}}\right]^{2}}\\ &=&\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}} \dot{}\frac{\sum_{k=1}^N e^{x_{k}}-e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\\ &=&\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}} \left(\frac{\sum_{k=1}^N e^{x_{k}}}{\sum_{k=1}^N e^{x_{k}}}-\frac{e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\right)\\ &=&\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}} \left(1-\frac{e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\right)\\ &=&\sigma(x_{j})(1-\sigma(x_{i})) \end{array}$

For the case where $i\neq j$: $\begin{array}{rcl} \frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}}&=&\frac{0-e^{x_{j}}e^{x_{i}}}{\left[\sum_{k=1}^N e^{x_{k}}\right]^{2}}\\ &=&0-\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}}\dot{}\frac{e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\\ &=&0-\sigma(x_{j})\sigma(x_{i}) \end{array}$

Here are some examples of those function:

S = softmax_basic(x)

# where i = j
div = S * (1 - S)
print("S :", S, "div :", div)

div = S * (1 - S)
print("S :", S, "div :", div)

div = S * (1 - S)
print("S :", S, "div :", div)

# where i != j
div = 0 - (S * S)
print("S :", S, "S :", S, "div :", div)

div = 0 - (S * S)
print("S :", S, "S :", S, "div :", div)

div = 0 - (S * S)
print("S :", S, "S :", S, "div :", div)

Here’s the output (yours may have different numbers):

S : 0.16205870737014744 div : 0.13579568273566436
S : 0.1530827392322831 div : 0.12964841418142392
S : 0.22661095862407418 div : 0.1752584320555523
S : 0.16205870737014744 S : 0.1530827392322831 div : -0.024808390840665155
S : 0.1530827392322831 S : 0.22661095862407418 div : -0.034690226286226845
S : 0.22661095862407418 S : 0.226342511074585 div : -0.051291693411991836

Let’s tidy up the formula a little so we can come up with a decent implementation in the code.

Looking at the case where $i=j$: $\frac{\delta{S_{i}}}{\delta{x_{j}}}=\sigma(x_{j})(1-\sigma(x_{i}))\\ =\sigma(x_{j}) - \sigma(x_{j})\sigma(x_{i})$

And where $i\neq j$ $\frac{\delta{S_{i}}}{\delta{x_{j}}}=0-\sigma(x_{j})\sigma(x_{i})$

We can generalise that formula by calculating the Jacobian matrix. This matrix will look like this: $\frac{\delta{S}}{\delta{x}} = \left [ \begin{array}{ccc} \frac{\delta{S_{1}}}{\delta{x_{1}}} & \ldots & \frac{\delta{S_{1}}}{\delta{x_{N}}} \\ \ldots & \frac{\delta{S_{i}}}{\delta{x_{j}}} & \ldots \\ \frac{\delta{S_{N}}}{\delta{x_{1}}} & \ldots & \frac{\delta{S_{N}}}{\delta{x_{N}}} \end{array} \right ]$ $\frac{\delta{S}}{\delta{x}} = \left [ \begin{array}{ccc} \sigma(x_{1}) - \sigma(x_{1})\sigma(x_{1}) & \ldots & 0-\sigma(x_{1})\sigma(x_{N}) \\ \ldots & \sigma(x_{j}) - \sigma(x_{j})\sigma(x_{i}) & \ldots \\ 0-\sigma(x_{N})\sigma(x_{1}) & \ldots & \sigma(x_{N}) - \sigma(x_{N})\sigma(x_{N}) \end{array} \right ]$ $\frac{\delta{S}}{\delta{x}} = \left [ \begin{array}{ccc} \sigma(x_{1}) & \ldots & 0 \\ \ldots & \sigma(x_{j}) & \ldots \\ 0 & \ldots & \sigma(x_{N})\end{array} \right ] - \left[ \begin{array}{ccc} \sigma(x_{1})\sigma(x_{1}) & \ldots & \sigma(x_{1})\sigma(x_{N}) \\ \ldots & \sigma(x_{j})\sigma(x_{i}) & \ldots \\ \sigma(x_{N})\sigma(x_{1}) & \ldots & \sigma(x_{N})\sigma(x_{N}) \end{array} \right ]$

The matrix on the left is simply the vector S laid out along a diagonal. Numpy provides a diag function to do this for us:

np.diag(S)

And that generates a matrix for us:

array([[0.16205871, 0.        , 0.        , 0.        , 0.        ],
[0.        , 0.15308274, 0.        , 0.        , 0.        ],
[0.        , 0.        , 0.22661096, 0.        , 0.        ],
[0.        , 0.        , 0.        , 0.22634251, 0.        ],
[0.        , 0.        , 0.        , 0.        , 0.23190508]])

The matrix on the right is comprised of every element in S multiplied by all the elements in S. So we can create the first matrix by repeating S in rows, create the second matrix by repeating S in columns (i.e. transposing the first matrix) and finally, multiplying them together element-wise. Numpy comes to our help again by providing a tile function:

S_vector = S.reshape(S.shape,1)
S_matrix = np.tile(S_vector,S.shape)
print(S_matrix, '\n')
print(np.transpose(S_matrix))

And the output of this (again, yours will have different numbers due to the random initialisation):

[[0.16205871 0.16205871 0.16205871 0.16205871 0.16205871]
[0.15308274 0.15308274 0.15308274 0.15308274 0.15308274]
[0.22661096 0.22661096 0.22661096 0.22661096 0.22661096]
[0.22634251 0.22634251 0.22634251 0.22634251 0.22634251]
[0.23190508 0.23190508 0.23190508 0.23190508 0.23190508]]

[[0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
[0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
[0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
[0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
[0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]]

Finally, let’s bring that all together into the single formula to calculate the Jacobian derivative of the Softmax function is:

np.diag(S) - (S_matrix * np.transpose(S_matrix))

Example output:

array([[ 0.13579568, -0.02480839, -0.03672428, -0.03668077, -0.03758224],
[-0.02480839,  0.12964841, -0.03469023, -0.03464913, -0.03550067],
[-0.03672428, -0.03469023,  0.17525843, -0.05129169, -0.05255223],
[-0.03668077, -0.03464913, -0.05129169,  0.17511158, -0.05248998],
[-0.03758224, -0.03550067, -0.05255223, -0.05248998,  0.17812512]])

So that’s the Softmax function and it’s derivative.

In the next part, we’ll look at how to combine this derivative with the summation function, common in artificial neural networks.

For several years now, few articles about artificial intelligence in the popular press are published without being accompanied by a picture of a Terminator robot. The point is clear: artificial intelligence is coming and it is terrifying.

Having sown the seeds of fear, the headline writers are now subtly reinforcing that view.

Take TechCrunch, which claims on its Editorial page to be “delivering top-notch reporting on the business of the tech industry”. This week it covered a story about Google using machine learning algorithms developed by its sibling company, DeepMind, to improve the efficiency of it’s data centres. These algorithms will look after the cooling systems and should deliver energy savings of 30%. This is a really great use of AI, making an expensive process cheaper and being good for the environment too.

But the headline is pure click-bait. Instead of focusing on the positives, the headline reads “Google gives its AI the reins over its data center cooling systems”. It invokes mental images of Skynet, HAL 9000 and even VIKI from I, Robot taking over. Yes, Google has an AI. It’s giving more control to it every day. You should be frightened.

Except it’s not true. And it’s really irritating.

Google doesn’t have an AI. It does have complicated decision making software running its cooling centres, one of the many complex software systems that keep the company alive every day. Most of this decision making is automated using traditional software development techniques. Lately, more of it is using machine learning models to make decisions faster than people could do.

What is irritating about this is that it distracts people from the real problems that AI is causing. Hard social problems such as the potential loss of jobs to automation, the bias inherent in any machine learning algorithms, and the concentration of this immense power in corporate hands with no oversight, are demanding attention.

These problems are complex and require lots of thinking and discussion by people to enable society to address the effects of powerful technology. We are poorly served by click-bait headlines in preparing for the artificial intelligence future.

# Neural Networks and the generalisation problem

Over the last few weeks, a robust debate has been taking place online about the prospects that Deep Learning neural networks would lead to advances in the quest for Artificial General Intelligence.

All current AI is what is known as Artificial Narrow Intelligence. This means that the models work well (sometimes extremely well) on specific problems that are well defined. Unfortunately, they are also quite brittle and do not generalise to other problems, or even variants of the problem they are trained on. By contrast, a long-term goal of the field is to get to AIs that can generalise and extrapolate, amongst other things. This is called Artificial General Intelligence.

The debate started back in September when Geoffrey Hinton proposed that researchers should start looking at alternatives to the default back propagation algorithms that are currently quite successful. This was followed up by a more detailed critical review published by Gary Marcus earlier this month outlining many of the problems with neural networks and deep learning. There has been quite a bit of debate about the merits of Marcus’ points on social media, so much so that he published a defence on Medium, responding to the various criticisms raised.

One of the most serious points is that artificial neural networks don’t generalise and cannot extrapolate from what they have been trained on to new instances with different characteristics. This is quite simple to demonstrate using Keras and TensorFlow. In this example, we have a neural network that tries to learn to reverse the digits in a binary number:

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import SGD

x = np.array([[0,0,0,0], [0,0,0,1], [0,0,1,0],
[0,1,0,1], [0,1,1,0], [0,1,1,1]])
y = np.array([[0,0,0,0], [1,0,0,0], [0,1,0,0],
[1,0,1,0], [0,1,1,0], [1,1,1,0]])

model = Sequential()
model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.1))

model.fit(x,y, epochs=10000, batch_size=8, verbose=0)

The above model  trains the network on examples where the first digit is always 0. If we test the trained network, it will perform as expected, reversing the order of the digits:

np.around(model.predict(x))

generates the following output:

array([[0., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[1., 0., 1., 0.],
[0., 1., 1., 0.],
[1., 1., 1., 0.]], dtype=float32)

Which is exactly what we want. However, when we change the inputs to begin with ‘1’ then the network gets confused:

offdim = np.array([[1,0,0,0], [1,0,0,1], [1,0,1,0],
[1,1,0,1], [1,1,1,0], [1,1,1,1]])
np.around(model.predict(offdim))

as we can see in the following output:

array([[0., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[1., 0., 1., 0.],
[0., 1., 1., 0.],
[1., 1., 1., 0.]], dtype=float32)

Clearly the network isn’t learning anything about the concept of “reversing” even though it looks like it initially. It has completely ignored the new number.

This lack of generalising demonstrates clearly that neural networks, as currently architected, don’t operate on any level of abstraction. This is one of the fundamental problems with neural networks that must be solved if we have any hope of them advancing our efforts towards Artificial General Intelligence.

As Marcus points out in his defence article, there may be a need for a blend of techniques, including old-school symbolic AI to help deep learning networks move forward towards solving the generalisation problem.

Sources:

Deep Learning, A Critical Appraisal – Gary Marcus https://arxiv.org/abs/1801.00631 Deep Learning is at the core of much of modern Artificial Intelligence. It has had some spectacular recent successes, not least being a major part of the system that beat the world champion at Go.

Key to its success is the Back-Propagation algorithm, usually shortened to “Backprop”. I’ve written elsewhere about how this algorithm works, but essentially, it takes an error in the output of a neural network and propagates it backwards through the network, adjusting the network’s configuration as it goes to reduce that output error.

This algorithm has been so successful that deep learning neural networks are finding applications in a broad range of industries. Much of the recent popularisation of artificial intelligence and machine learning is due to this very success.

Now one of the longest proponents of this algorithm, Geoffrey Hinton from the University of Toronto has suggested that if progress is to be made on AI, then focus must shift away from Backprop. His view is based on the observation that the brain doesn’t learn that way and if intelligent machines are to be developed, new techniques are required.

In particular, he suggests that Unsupervised Learning will prove to be a fertile area. The method of learning does not rely on having fully labelled or classified datasets the way Supervised Learning and Backprop need. Instead it tries to understand patterns within the data itself to be able to make predictions and classifications.

Deep Learning does still have a lot to offer. However, given it’s requirements for large amounts of data and computing power, there is an increasing awareness that alternative machine learning techniques may prove to be equally fruitful.

# XOR Revisited: Keras and TensorFlow

A few weeks ago, it was announced that Keras would be getting official Google support and would become part of the TensorFlow machine learning library. Keras is a collection of high-level APIs in Python for creating and training neural networks, using either Theano or TensorFlow as the underlying engine.

Given my previous posts on implementing an XOR-solving neural network in a variety of different languages and tools, I thought it was time to see what it would look like in Keras.

XOR can be expressed as a classification problem that is best illustrated in a diagram. The goal is to create a neural network that will correctly predict the values 0 or 1, depending on the inputs x1 and x2 as shown. The neural network that is capable of being trained to solve that problem looks like this: If you’d like to understand why this is the case, have a look at the detailed explanation in the posts implementing the solution in Octave.

So how does this look in Keras? Well it’s rather simple. Assuming you’ve already installed Keras, we’ll start with setting up the classification problem and the expected outputs:

import numpy as np

x = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([, , , ])

So far, so good. We’re using numpy arrays to store our inputs (x) and outputs (y). Now for the neural network definition:

from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()
model.add(Activation('sigmoid'))

The Sequential model is simply a sequence of layers making up the network. Our diagram above has a set of inputs being fed into two processing layers. We’ve already defined the inputs, so all we need to do is add the other two layers.

In Keras, we’ll use Dense layers, which simply means they are is fully connected. The parameters indicate that the first layer has two nodes and the second layer has one node, corresponding to the diagram above.

The first layer also has the shape of the inputs which in this case is a one-dimensional vector with 2 elements. The second layer’s inputs will be inferred from the first layer.

We then add an Activation of type ‘sigmoid’ to each layer, again matching our neural network definition.

Note that Keras looks after the bias input without us having to explicitly code for it. In addition, Keras also looks after the weights (Θ1 and Θ2). This makes our neural network definition really straightforward and shows the benefits of using a high-level abstraction.

Finally, we apply a loss function and learning mode for Keras to be able to adjust the neural network:

model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy'])

In this example, we’ll use the standard Mean Squared Error loss function and Stochastic Gradient Descent optimiser. And that’s it for the network definition.

If you want to see that the network looks like, use:

model.summary()

The network should look like this:

>> model.summary()
_________________________________________________________________
Layer (type)                  Output Shape         Param #
=================================================================
dense_1 (Dense)               (None, 2)            6
_________________________________________________________________
activation_1 (Activation)     (None, 2)            0
_________________________________________________________________
dense_2 (Dense)               (None, 1)            3
_________________________________________________________________
activation_2 (Activation)     (None, 1)            0
=================================================================
Total params: 9
Trainable params: 9
Non-trainable params: 0
_________________________________________________________________
>>>

Now we just need to kick off the training of the network.

model.fit(x,y, epochs=100000, batch_size=4)

All going well, the network weights will converge on a solution that can correctly classify the inputs (if not, you may need to up the number of epochs):

>>> model.predict(x, verbose=1)

4/4 [==============================] - 0s

array([[ 0.07856689],

[ 0.91362464],

[ 0.92543262],

[ 0.06886736]], dtype=float32)

>>>

Clearly this network is on it’s way to converging on the original expected outputs we defined above (y).

So that’s all there is to a Keras version of the XOR-solving neural network. The fact that it is using TensorFlow as the engine is completely hidden and that makes implementing the network a lot simpler.

# Artificial Intelligence to replace staff at O2 AI will eliminate jobs. Will new ones be created in time?

Studies have previously claimed that Artificial Intelligence is likely to replace jobs in the medium term.

This may have been a little optimistic in estimating the timeframe that it would likely occur.

Earlier this year, Fukoku Mutual Life Insurance announced that it would be replacing 34 employees, whose jobs involved calculating payouts to policyholders.

This week, the mobile phone company O2 announced that it would be launching a voice recognition AI next year that would be able to do the same job as customer service staff.

At Mobile World Congress in Barcelona, O2’s parent Telefonica presented the system. The Independent’s coverage of it pedicts:

It’s expected to launch in the UK next year, and will enable the company to cut customer service costs.

Cutting customer service costs can only mean a reduction in jobs. Coupled with their idea to sell the customer’s data, it looks like the AI system will help Telefonica turn what is a business cost into a revenue stream, all with less employees to worry about.

# AI ‘judge’ doesn’t explain why it reaches certain decisions

The Guardian reports on a recent paper by University College London researchers that are using artificial intelligence to predict the outcome of trials at the European Court of Human Rights.

Their approach employs natural language processing (NLP) to build a machine learning model, using the text records from previous trials. As such, it demonstrates the power of modern NLP techniques. Given enough relevant text in a particular area, NLP can discover complex underlying patterns. These patterns are then used to predict an outcome using new case texts.

However, the biggest obstacle to it being used in courts is that it is totally unable to explain why it has made the prediction it has. This problem plagues many machine learning implementations. The underlying mathematics of machine learning models is understood, but not well enough to be able to say for certain why a given input determines a given output. Unlike a human judge.

So for the moment, an AI won’t be determining if you really are innocent or guilty in a court of law.