# The Softmax function derivative

### Introduction

This post demonstrates the calculations behind the evaluation of the Softmax Derivative using Python. It is based on the excellent article by Eli Bendersky which can be found here.

### The Softmax Function

The softmax function simply takes a vector of N dimensions and returns a probability distribution also of N dimensions. Each element of the output is in the range (0,1) and the sum of the elements of N is 1.0.

Each element of the output is given by the formula: $\sigma(z_{i})=\frac{e^{z_{i}}} {\sum_{j=1}^K e^{z_{j}}}$

See https://en.wikipedia.org/wiki/Softmax_function for more details.

import numpy as np
x = np.random.random()

def softmax_basic(z):
exps = np.exp(z)
sums = np.sum(exps)
return np.divide(exps, sums)

softmax_basic(x)

This should generate an output that looks something like this:

array([0.97337094, 0.85251098, 0.62495691, 0.63957056, 0.6969253 ])

We expect that the sum of those will be (close to) 1.0:

np.sum(softmax_basic(x))

And so it is:

1.0

### Calculating the derivative

We need to calculate the partial derivative of the probability outputs $S_{i}$ with respect to each of the inputs $x_{j}$. For example: $\frac{\delta{S_{i}}}{\delta{x_{j}}}=\frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}}$

Following Bendersky’s derivation, we need to use the quotient rule for derivatives: $\text{If }f(x)=\frac{g(x)}{h(x)} \text{ then:}$ $f'(x)=\frac{g'(x)h(x)-h'(x)g(x)}{[h(x)]^{2}}$

From the Softmax function: $\begin{array}{rcl} g_{i} & = & e^{x_{i}}\\ h_{i} & = & \sum_{k=1}^N e^{x_{k}} \end{array}$

The derivatives of these functions with respect to $x_{j}$ are: \frac{\delta{g_{i}}}{\delta{x_{j}}} = \left \{ \begin{aligned} &e^{x_{j}}, && \text{if}\ i=j \\ &0, && \text{otherwise} \end{aligned} \right.

and $\frac{\delta{h_{i}}}{\delta{x_{j}}}=\frac{\delta(e^{x_{1}}+e^{x_{2}}+...+e^{x_{N}})}{\delta{x_{j}}}=e^{x_{j}}$

Now we have to evalutate the quotient rule for the two seperate cases where $i=j$ and where $i\neq{j}$:

Starting with $i=j$: $\frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}}=\frac{e^{x_{j}}\sum_{k=1}^N e^{x_{k}}-e^{x_{j}}e^{x_{i}}}{\left[\sum_{k=1}^N e^{x_{k}}\right]^{2}}$

Now we’ll just simplify this a bit: $\begin{array}{rcl} \frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}} & = & \frac{e^{x_{j}}\left(\sum_{k=1}^N e^{x_{k}}-e^{x_{i}}\right)}{\left[\sum_{k=1}^N e^{x_{k}}\right]^{2}}\\ &=&\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}} \dot{}\frac{\sum_{k=1}^N e^{x_{k}}-e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\\ &=&\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}} \left(\frac{\sum_{k=1}^N e^{x_{k}}}{\sum_{k=1}^N e^{x_{k}}}-\frac{e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\right)\\ &=&\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}} \left(1-\frac{e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\right)\\ &=&\sigma(x_{j})(1-\sigma(x_{i})) \end{array}$

For the case where $i\neq j$: $\begin{array}{rcl} \frac{\delta{\frac{e^{x_{i}}} {\sum_{k=1}^N e^{x_{k}}}}}{\delta{x_{j}}}&=&\frac{0-e^{x_{j}}e^{x_{i}}}{\left[\sum_{k=1}^N e^{x_{k}}\right]^{2}}\\ &=&0-\frac{e^{x_{j}}}{\sum_{k=1}^N e^{x_{k}}}\dot{}\frac{e^{x_{i}}}{\sum_{k=1}^N e^{x_{k}}}\\ &=&0-\sigma(x_{j})\sigma(x_{i}) \end{array}$

Here are some examples of those function:

S = softmax_basic(x)

# where i = j
div = S * (1 - S)
print("S :", S, "div :", div)

div = S * (1 - S)
print("S :", S, "div :", div)

div = S * (1 - S)
print("S :", S, "div :", div)

# where i != j
div = 0 - (S * S)
print("S :", S, "S :", S, "div :", div)

div = 0 - (S * S)
print("S :", S, "S :", S, "div :", div)

div = 0 - (S * S)
print("S :", S, "S :", S, "div :", div)

Here’s the output (yours may have different numbers):

S : 0.16205870737014744 div : 0.13579568273566436
S : 0.1530827392322831 div : 0.12964841418142392
S : 0.22661095862407418 div : 0.1752584320555523
S : 0.16205870737014744 S : 0.1530827392322831 div : -0.024808390840665155
S : 0.1530827392322831 S : 0.22661095862407418 div : -0.034690226286226845
S : 0.22661095862407418 S : 0.226342511074585 div : -0.051291693411991836

Let’s tidy up the formula a little so we can come up with a decent implementation in the code.

Looking at the case where $i=j$: $\frac{\delta{S_{i}}}{\delta{x_{j}}}=\sigma(x_{j})(1-\sigma(x_{i}))\\ =\sigma(x_{j}) - \sigma(x_{j})\sigma(x_{i})$

And where $i\neq j$ $\frac{\delta{S_{i}}}{\delta{x_{j}}}=0-\sigma(x_{j})\sigma(x_{i})$

We can generalise that formula by calculating the Jacobian matrix. This matrix will look like this: $\frac{\delta{S}}{\delta{x}} = \left [ \begin{array}{ccc} \frac{\delta{S_{1}}}{\delta{x_{1}}} & \ldots & \frac{\delta{S_{1}}}{\delta{x_{N}}} \\ \ldots & \frac{\delta{S_{i}}}{\delta{x_{j}}} & \ldots \\ \frac{\delta{S_{N}}}{\delta{x_{1}}} & \ldots & \frac{\delta{S_{N}}}{\delta{x_{N}}} \end{array} \right ]$ $\frac{\delta{S}}{\delta{x}} = \left [ \begin{array}{ccc} \sigma(x_{1}) - \sigma(x_{1})\sigma(x_{1}) & \ldots & 0-\sigma(x_{1})\sigma(x_{N}) \\ \ldots & \sigma(x_{j}) - \sigma(x_{j})\sigma(x_{i}) & \ldots \\ 0-\sigma(x_{N})\sigma(x_{1}) & \ldots & \sigma(x_{N}) - \sigma(x_{N})\sigma(x_{N}) \end{array} \right ]$ $\frac{\delta{S}}{\delta{x}} = \left [ \begin{array}{ccc} \sigma(x_{1}) & \ldots & 0 \\ \ldots & \sigma(x_{j}) & \ldots \\ 0 & \ldots & \sigma(x_{N})\end{array} \right ] - \left[ \begin{array}{ccc} \sigma(x_{1})\sigma(x_{1}) & \ldots & \sigma(x_{1})\sigma(x_{N}) \\ \ldots & \sigma(x_{j})\sigma(x_{i}) & \ldots \\ \sigma(x_{N})\sigma(x_{1}) & \ldots & \sigma(x_{N})\sigma(x_{N}) \end{array} \right ]$

The matrix on the left is simply the vector S laid out along a diagonal. Numpy provides a diag function to do this for us:

np.diag(S)

And that generates a matrix for us:

array([[0.16205871, 0.        , 0.        , 0.        , 0.        ],
[0.        , 0.15308274, 0.        , 0.        , 0.        ],
[0.        , 0.        , 0.22661096, 0.        , 0.        ],
[0.        , 0.        , 0.        , 0.22634251, 0.        ],
[0.        , 0.        , 0.        , 0.        , 0.23190508]])

The matrix on the right is comprised of every element in S multiplied by all the elements in S. So we can create the first matrix by repeating S in rows, create the second matrix by repeating S in columns (i.e. transposing the first matrix) and finally, multiplying them together element-wise. Numpy comes to our help again by providing a tile function:

S_vector = S.reshape(S.shape,1)
S_matrix = np.tile(S_vector,S.shape)
print(S_matrix, '\n')
print(np.transpose(S_matrix))

And the output of this (again, yours will have different numbers due to the random initialisation):

[[0.16205871 0.16205871 0.16205871 0.16205871 0.16205871]
[0.15308274 0.15308274 0.15308274 0.15308274 0.15308274]
[0.22661096 0.22661096 0.22661096 0.22661096 0.22661096]
[0.22634251 0.22634251 0.22634251 0.22634251 0.22634251]
[0.23190508 0.23190508 0.23190508 0.23190508 0.23190508]]

[[0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
[0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
[0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
[0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]
[0.16205871 0.15308274 0.22661096 0.22634251 0.23190508]]

Finally, let’s bring that all together into the single formula to calculate the Jacobian derivative of the Softmax function is:

np.diag(S) - (S_matrix * np.transpose(S_matrix))

Example output:

array([[ 0.13579568, -0.02480839, -0.03672428, -0.03668077, -0.03758224],
[-0.02480839,  0.12964841, -0.03469023, -0.03464913, -0.03550067],
[-0.03672428, -0.03469023,  0.17525843, -0.05129169, -0.05255223],
[-0.03668077, -0.03464913, -0.05129169,  0.17511158, -0.05248998],
[-0.03758224, -0.03550067, -0.05255223, -0.05248998,  0.17812512]])

So that’s the Softmax function and it’s derivative. In the next post, we’ll be looking at what happens when Softmax is combined with other functions, such as happens in an output layer in an artificial neural network.

# Bad headlines distract from real AI problems

For several years now, few articles about artificial intelligence in the popular press are published without being accompanied by a picture of a Terminator robot. The point is clear: artificial intelligence is coming and it is terrifying.

Having sown the seeds of fear, the headline writers are now subtly reinforcing that view.

Take TechCrunch, which claims on its Editorial page to be “delivering top-notch reporting on the business of the tech industry”. This week it covered a story about Google using machine learning algorithms developed by its sibling company, DeepMind, to improve the efficiency of it’s data centres. These algorithms will look after the cooling systems and should deliver energy savings of 30%. This is a really great use of AI, making an expensive process cheaper and being good for the environment too.

But the headline is pure click-bait. Instead of focusing on the positives, the headline reads “Google gives its AI the reins over its data center cooling systems”. It invokes mental images of Skynet, HAL 9000 and even VIKI from I, Robot taking over. Yes, Google has an AI. It’s giving more control to it every day. You should be frightened.

Except it’s not true. And it’s really irritating.

Google doesn’t have an AI. It does have complicated decision making software running its cooling centres, one of the many complex software systems that keep the company alive every day. Most of this decision making is automated using traditional software development techniques. Lately, more of it is using machine learning models to make decisions faster than people could do.

What is irritating about this is that it distracts people from the real problems that AI is causing. Hard social problems such as the potential loss of jobs to automation, the bias inherent in any machine learning algorithms, and the concentration of this immense power in corporate hands with no oversight, are demanding attention.

These problems are complex and require lots of thinking and discussion by people to enable society to address the effects of powerful technology. We are poorly served by click-bait headlines in preparing for the artificial intelligence future.

# Neural Networks and the generalisation problem

Over the last few weeks, a robust debate has been taking place online about the prospects that Deep Learning neural networks would lead to advances in the quest for Artificial General Intelligence.

All current AI is what is known as Artificial Narrow Intelligence. This means that the models work well (sometimes extremely well) on specific problems that are well defined. Unfortunately, they are also quite brittle and do not generalise to other problems, or even variants of the problem they are trained on. By contrast, a long-term goal of the field is to get to AIs that can generalise and extrapolate, amongst other things. This is called Artificial General Intelligence.

The debate started back in September when Geoffrey Hinton proposed that researchers should start looking at alternatives to the default back propagation algorithms that are currently quite successful. This was followed up by a more detailed critical review published by Gary Marcus earlier this month outlining many of the problems with neural networks and deep learning. There has been quite a bit of debate about the merits of Marcus’ points on social media, so much so that he published a defence on Medium, responding to the various criticisms raised.

One of the most serious points is that artificial neural networks don’t generalise and cannot extrapolate from what they have been trained on to new instances with different characteristics. This is quite simple to demonstrate using Keras and TensorFlow. In this example, we have a neural network that tries to learn to reverse the digits in a binary number:

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import SGD

x = np.array([[0,0,0,0], [0,0,0,1], [0,0,1,0],
[0,1,0,1], [0,1,1,0], [0,1,1,1]])
y = np.array([[0,0,0,0], [1,0,0,0], [0,1,0,0],
[1,0,1,0], [0,1,1,0], [1,1,1,0]])

model = Sequential()
model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.1))

model.fit(x,y, epochs=10000, batch_size=8, verbose=0)

The above model  trains the network on examples where the first digit is always 0. If we test the trained network, it will perform as expected, reversing the order of the digits:

np.around(model.predict(x))

generates the following output:

array([[0., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[1., 0., 1., 0.],
[0., 1., 1., 0.],
[1., 1., 1., 0.]], dtype=float32)

Which is exactly what we want. However, when we change the inputs to begin with ‘1’ then the network gets confused:

offdim = np.array([[1,0,0,0], [1,0,0,1], [1,0,1,0],
[1,1,0,1], [1,1,1,0], [1,1,1,1]])
np.around(model.predict(offdim))

as we can see in the following output:

array([[0., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[1., 0., 1., 0.],
[0., 1., 1., 0.],
[1., 1., 1., 0.]], dtype=float32)

Clearly the network isn’t learning anything about the concept of “reversing” even though it looks like it initially. It has completely ignored the new number.

This lack of generalising demonstrates clearly that neural networks, as currently architected, don’t operate on any level of abstraction. This is one of the fundamental problems with neural networks that must be solved if we have any hope of them advancing our efforts towards Artificial General Intelligence.

As Marcus points out in his defence article, there may be a need for a blend of techniques, including old-school symbolic AI to help deep learning networks move forward towards solving the generalisation problem.

Sources:

Deep Learning, A Critical Appraisal – Gary Marcus https://arxiv.org/abs/1801.00631

In defense of skepticism about deep learning – Gary Marcus https://medium.com/@GaryMarcus/in-defense-of-skepticism-about-deep-learning-6e8bfd5ae0f1

# Deep Learning Dead-End? Deep Learning is at the core of much of modern Artificial Intelligence. It has had some spectacular recent successes, not least being a major part of the system that beat the world champion at Go.

Key to its success is the Back-Propagation algorithm, usually shortened to “Backprop”. I’ve written elsewhere about how this algorithm works, but essentially, it takes an error in the output of a neural network and propagates it backwards through the network, adjusting the network’s configuration as it goes to reduce that output error.

This algorithm has been so successful that deep learning neural networks are finding applications in a broad range of industries. Much of the recent popularisation of artificial intelligence and machine learning is due to this very success.

Now one of the longest proponents of this algorithm, Geoffrey Hinton from the University of Toronto has suggested that if progress is to be made on AI, then focus must shift away from Backprop. His view is based on the observation that the brain doesn’t learn that way and if intelligent machines are to be developed, new techniques are required.

In particular, he suggests that Unsupervised Learning will prove to be a fertile area. The method of learning does not rely on having fully labelled or classified datasets the way Supervised Learning and Backprop need. Instead it tries to understand patterns within the data itself to be able to make predictions and classifications.

Deep Learning does still have a lot to offer. However, given it’s requirements for large amounts of data and computing power, there is an increasing awareness that alternative machine learning techniques may prove to be equally fruitful.

# XOR Revisited: Keras and TensorFlow

A few weeks ago, it was announced that Keras would be getting official Google support and would become part of the TensorFlow machine learning library. Keras is a collection of high-level APIs in Python for creating and training neural networks, using either Theano or TensorFlow as the underlying engine.

Given my previous posts on implementing an XOR-solving neural network in a variety of different languages and tools, I thought it was time to see what it would look like in Keras.

XOR can be expressed as a classification problem that is best illustrated in a diagram. The goal is to create a neural network that will correctly predict the values 0 or 1, depending on the inputs x1 and x2 as shown. The neural network that is capable of being trained to solve that problem looks like this: If you’d like to understand why this is the case, have a look at the detailed explanation in the posts implementing the solution in Octave.

So how does this look in Keras? Well it’s rather simple. Assuming you’ve already installed Keras, we’ll start with setting up the classification problem and the expected outputs:

import numpy as np

x = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([, , , ])

So far, so good. We’re using numpy arrays to store our inputs (x) and outputs (y). Now for the neural network definition:

from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()
model.add(Activation('sigmoid'))

The Sequential model is simply a sequence of layers making up the network. Our diagram above has a set of inputs being fed into two processing layers. We’ve already defined the inputs, so all we need to do is add the other two layers.

In Keras, we’ll use Dense layers, which simply means they are is fully connected. The parameters indicate that the first layer has two nodes and the second layer has one node, corresponding to the diagram above.

The first layer also has the shape of the inputs which in this case is a one-dimensional vector with 2 elements. The second layer’s inputs will be inferred from the first layer.

We then add an Activation of type ‘sigmoid’ to each layer, again matching our neural network definition.

Note that Keras looks after the bias input without us having to explicitly code for it. In addition, Keras also looks after the weights (Θ1 and Θ2). This makes our neural network definition really straightforward and shows the benefits of using a high-level abstraction.

Finally, we apply a loss function and learning mode for Keras to be able to adjust the neural network:

model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy'])

In this example, we’ll use the standard Mean Squared Error loss function and Stochastic Gradient Descent optimiser. And that’s it for the network definition.

If you want to see that the network looks like, use:

model.summary()

The network should look like this:

>> model.summary()
_________________________________________________________________
Layer (type)                  Output Shape         Param #
=================================================================
dense_1 (Dense)               (None, 2)            6
_________________________________________________________________
activation_1 (Activation)     (None, 2)            0
_________________________________________________________________
dense_2 (Dense)               (None, 1)            3
_________________________________________________________________
activation_2 (Activation)     (None, 1)            0
=================================================================
Total params: 9
Trainable params: 9
Non-trainable params: 0
_________________________________________________________________
>>>

Now we just need to kick off the training of the network.

model.fit(x,y, epochs=100000, batch_size=4)

All going well, the network weights will converge on a solution that can correctly classify the inputs (if not, you may need to up the number of epochs):

>>> model.predict(x, verbose=1)

4/4 [==============================] - 0s

array([[ 0.07856689],

[ 0.91362464],

[ 0.92543262],

[ 0.06886736]], dtype=float32)

>>>

Clearly this network is on it’s way to converging on the original expected outputs we defined above (y).

So that’s all there is to a Keras version of the XOR-solving neural network. The fact that it is using TensorFlow as the engine is completely hidden and that makes implementing the network a lot simpler.

# Artificial Intelligence to replace staff at O2 AI will eliminate jobs. Will new ones be created in time?

Studies have previously claimed that Artificial Intelligence is likely to replace jobs in the medium term.

This may have been a little optimistic in estimating the timeframe that it would likely occur.

Earlier this year, Fukoku Mutual Life Insurance announced that it would be replacing 34 employees, whose jobs involved calculating payouts to policyholders.

This week, the mobile phone company O2 announced that it would be launching a voice recognition AI next year that would be able to do the same job as customer service staff.

At Mobile World Congress in Barcelona, O2’s parent Telefonica presented the system. The Independent’s coverage of it pedicts:

It’s expected to launch in the UK next year, and will enable the company to cut customer service costs.

Cutting customer service costs can only mean a reduction in jobs. Coupled with their idea to sell the customer’s data, it looks like the AI system will help Telefonica turn what is a business cost into a revenue stream, all with less employees to worry about.

# AI ‘judge’ doesn’t explain why it reaches certain decisions

The Guardian reports on a recent paper by University College London researchers that are using artificial intelligence to predict the outcome of trials at the European Court of Human Rights.

Their approach employs natural language processing (NLP) to build a machine learning model, using the text records from previous trials. As such, it demonstrates the power of modern NLP techniques. Given enough relevant text in a particular area, NLP can discover complex underlying patterns. These patterns are then used to predict an outcome using new case texts.

However, the biggest obstacle to it being used in courts is that it is totally unable to explain why it has made the prediction it has. This problem plagues many machine learning implementations. The underlying mathematics of machine learning models is understood, but not well enough to be able to say for certain why a given input determines a given output. Unlike a human judge.

So for the moment, an AI won’t be determining if you really are innocent or guilty in a court of law.

# Quick test

Just looking to see what an aside post looks like on WordPress. Right. Looks very similar to a normal post.

# In Praise Of Reinventing The Wheel The original Ferris Wheel at the 1893 World Columbian Exposition in Chicago (Source: Public Domain via Wikipedia)

There is a strong principle in software engineering of reusing code wherever possible. It’s considered so important, that it’s got its own TLA: DRY (“Don’t Repeat Yourself”). In other words, don’t reinvent the wheel.

There are obvious benefits to this. Nobody wants to type the same piece of code over and over. So write it once as a function and reuse it wherever you need to, freeing yourself up to build other stuff. Better still, reuse someone else’s functions.

Indeed, the ready availability of open-source libraries and frameworks for almost every popular requirement under the sun, means that you almost never have to develop your own low level software. Thank goodness. Software development would get really tedious if you had to start out writing basic I/O routines with each new project.

Even new developers can get sophisticated projects up and running fairly quickly. Github and StackOverflow have seen to that.

The principle is taken to an extreme by some web developers, who include the entire jQuery library on their webpage in order to implement some simple form validation. I’m all in favour of simplifying JavaScript development, second only to abolishing it altogether. But including the whole library is a tad overkill.

And this highlights a problem with treating DRY as dogma.

When we rely on existing libraries and frameworks, we don’t get a good idea of what’s really happening within our code base. Yes, we can read the library code to get the gist of what’s going on. But we never really do.

There’s a really good reason to try to implement your own routines, if only in your test projects. You’ll get an appreciation of the lower level algorithms that your applications depend on. You may even discover how difficult it is to get some of these algorithms to run quickly or to use a small amount of memory.

This will lead you to the core principles of algorithmic analysis that are fundamental to computer science. Along the way, you’ll pick up a really good sense of what works well and what doesn’t, what’s fast and what’s slow, what’s memory efficient and what’s a complete hog. Hopefully, you can reflect those lessons in your own production application code.

So the next time you have a chance, try to implement your own I/O routines or other low level algorithms and see how you compare to other published libraries.

Remember, DRY is a principle, not a dogma.

Sometimes there is merit in repeating yourself (or someone else).

# Lethal Autonomous Weapon Systems are on the way Long Range Anti-Ship Missile (LRASM)

Chinatopix reports that a new missile with on-board Artificial Intelligence will be deployed by both the U.S. Navy and U.S. Air Force by 2018. The AI will be able to pick out the correct ship to target with a fleet. In addition, the article states that multiple LRASMs can share information, and attack as a swarm.

While not completely autonomous, this nevertheless represents a serious step towards ceding control of ordinance to a machine. Given the current poor understanding of how a lot of machine-learning actually works, this is a dangerous step.

Recently, the United Nations debated such Lethal Autonomous Weapon Systems (LAWS), with many countries pushing for an outright ban. With AI-based missiles in development, the UN and the international community will have to speed up their deliberations in order to prevent such weapons ever being deployed.