Solving MNIST with a Neural Network from the ground up

Note: Here’s the Python source code for this project in a Jupyter notebook on GitHub

I’ve written before about the benefits of reinventing the wheel and this is one of those occasions where it was definitely worth the effort. Sometimes, there is just no substitute for trying to implement an algorithm to really understand what’s going on under the hood. This is especially true when learning about artificial neural networks. Sure, there are plenty of frameworks available that you can use which implement any flavour of neural network, complete with a dazzling arrays of optimisations, activations and loss functions. That may solve your problem, but it abstracts away a lot of the details about why it solves it.

MNIST is a great dataset to start with. It’s a collection of images containing 60,000 handwritten digits. It also contains a further 10,000 images that can be used as the test set. It’s been well studied and most frameworks have sample implementations. Here’s an example image:

You can find the full dataset of images on Yann Le Cun’s website.

While it’s useful to reinvent the wheel, we should at least learn from those that have already built wheels before. The first thing I borrowed was the network architecture from TensorFlow. Their example has:

  • 28×28 input
  • a hidden layer with 512 neurons with ReLU activation
  • an output layer with 10 neutrons (representing the 10 possible digits) with Softmax activation
  • Cross-Entropy loss function

The next thing to work on was the feedforward part of the network. This is relatively straightforward as these functions are well documented online and the network itself isn’t complicated.

The tough part was working through the back-propagation algorithm. In a previous post, I detailed how to work out the derivatives of the Softmax function and the Cross Entropy loss. The most obvious way is to use the Chain Rule in Differential Calculus to work out the gradients and propagate them back through the network. The steps are pleasing to my eye and appeal to my sense of order in code. (Tip: Use a spreadsheet on a small example network to see the actual matrices in action.)

But (and it’s a big but), the basic approach uses Jacobian matrices. Each cell in these kind of matrices is a partial derivative; each matrix represents a change in every variable with respect to every output. As a result, they can grow rather large very quickly. We run into several issues multiplying very large matrices together. In the notebook, I’ve left the functions representing this approach in for comparison and if you do run it, you’ll notice immediately the problems with speed and memory.

Luckily there are shortcuts, which mean that we can directly calculate the gradients without resorting to Jacobian matrix multiplication. You can see these in the Short Form section of the notebook. In a sense though, these are abstractions too and it’s difficult to see the back-propagation from the shortcut methods.

Lastly, I’ve implemented some code to gather the standard metrics for evaluating how good a machine learning model is. I’ve run it several times and it usually gets an overall accuracy score of between 92% and 95% on the MNIST test dataset.

One of the main things I learned from this exercise is that the actual coding of a network is relatively simple. The really hard part that took a while was figuring out the calculus and especially the shortcuts. I really appreciate now why those frameworks are popular and make coding neural networks so much easier.

If you fancy a challenge, I can recommend working on a neural network from first principles. You never know what you might learn!

Data Mining Algorithms Explained In Plain Language

Here’s a really great resource. Raymond Li from Microsoft has written an explanation of the top 10 data mining algorithms, but in plain language. These algorithms are used a lot in machine learning too.

So if you are confused about Naive Bayes or Support Vector Machines, then take a look at Ray’s easy to understand explanations.

Top 10 data mining algorithms in plain English |

Ambient Intelligence

This is an interesting idea: Ambient Intelligence.

“an ever-present digital fog in tune with our behavior and physiological state”

As AI slowly improves in particular domains, we will see the techniques and algorithms incorporated into previously static control systems.

For example, your household heating system could learn the best way to adjust the temperature in the most efficient way, having learned what you like. Other appliances will have some form of learning built in too.

So what we will see will be the widespread adoption of Artificial Narrow Intelligence. There won’t be one Artificial General Intelligence in your house, just lots of narrow ones.

The problem I see with this is the lack of serendipity. What a bland world it would be if we were surrounded by devices which made the world just perfect for us with no surprises. How would we ever be enticed to step outside our comfort zones? We would replicate our “echo chamber” online experience in the real world.

Here’s the full article by Neil Howe (@lifecourse):

Artificial Intelligence Paves The Way For Ambient Intelligence – Forbes.

The Machine Vision Algorithm Beating Art Historians at Their Own Game

This is an interesting application of image processing. The machine learning algorithms are trained on a subset of paintings taken from a data set of more than 80,000. The resulting feature set has over 400 dimensions.

When presented with a painting it has not seen before, it correctly guessed the artist more than 60% of the time. It has also detected additional links between different styles and periods:

It links expressionism and fauvism, which might be expected given that the latter movement is often thought of as a type of expressionism. It links the mannerist and Renaissance styles, which clearly reflects that fact that mannerism is a form of early Renaissance painting.

However, it also apparently confuses certain styles:

… it often confuses examples of abstract expressionism and action paintings, in which artists drip or fling paint and step on the canvas. Saleh and Elgammal [the creators of the ML algorithms] … say that this kind of mix-up would be entirely understandable for a human viewer. “’Action painting’ is a type or subgenre of “abstract expressionism,’” they point out.

Of course, this could also mean that the machine is correct and different “genres” of abstact paintings are completely arbitrary. But what is does highlight is that machine learning has a way to go before it can start offering subjective opinions.

via The Machine Vision Algorithm Beating Art Historians at Their Own Game | MIT Technology Review.

Poker Pros Win Against Carnegie Mellon AI

A few weeks ago in another post, I mentioned that three human professional poker players were going up against an Artificial Intelligence called Claudico. Time to check in at the Rivers Casino in Pittsburgh to see how they got on.

Pittsburgh Supercomputing Center's Blacklight supercomputer which hosted Claudico
Pittsburgh Supercomputing Center’s Blacklight supercomputer which hosted Claudico

Over the week, the professionals played 80,000 hands of no-limit hold ’em poker against the AI. Three of the four players ended up with more chips than the computer, so it would seem that for now, humans still hold the advantage.

In the final chip tally, Bjorn Li had an individual chip total of $529,033, Doug Polk had $213,671 and Dong Kim had $70,491. Jason Les trailed Claudico by $80,482.

However, the actual winnings as a percentage of the amounts being bet over the week ($170m), means that the result is much closer than would appear. So just like Deep Blue in the 1990’s, the AI doesn’t seem to be too far away from an outright win.

Brains Vs. AI | Carnegie Mellon School of Computer Science.

Deep Learning Machine Solves the Cocktail Party Problem


Everyone is familiar with the cocktail party effect. When chatting in a crowded space, humans can focus on a particular speaker’s voice, while filtering out other voices and noise.

We do this without really thinking about it, effortlessly. But it turns out to be a surprisingly difficult problem to solve for a machine.

However, a team at the University of Sussex have developed a technique to separate a singer’s voice from the music in songs. Karaoke anyone?

Deep Learning Machine Solves the Cocktail Party Problem | MIT Technology Review.

Should we be thinking about Artificial Intelligence Rights?

Ryan Calo (@rcalo) writes an interesting piece in Forbes (see link below) about conferring human rights on Artificial Intelligences, a theme Alex Garland has been discussing during the promotion of his new film Ex Machina.

He rightly points out that we can’t do this without radical changes to our laws and institutions. He mentions the right to reproduce as something that, should the AI choose to exercise this right, we could be overwhelmed by.

He also states:

There is reason to believe we will never be able to recreate so-called strong artificial intelligence.

This is something that is still under debate, so it’s not at all decided that we cannot create Strong AI. However, we are a long way from even generating a rudimentary Artificial General Intelligence.

Nevertheless, it is worth considering the hypothetical rights that could be granted to an AI. Clearly, an AI is not a human in the strict biological sense. But should this distinction mean an AI should be treated as a lesser entity, even if it demonstrates all the sentience of a human? This is uncomfortably close to the way slaves were treated in our past and is sure to be a hot topic of debate.

In any event, history has shown that politics and laws always trail behind technologies. Should this continue to be the case, we will attain Strong AI long before the case AI vs. State reaches the inside of a courtroom.

via What Ex Machina’s Alex Garland Gets Wrong About Artificial Intelligence.