I’ve written before about the benefits of reinventing the wheel and this is one of those occasions where it was definitely worth the effort. Sometimes, there is just no substitute for trying to implement an algorithm to really understand what’s going on under the hood. This is especially true when learning about artificial neural networks. Sure, there are plenty of frameworks available that you can use which implement any flavour of neural network, complete with a dazzling arrays of optimisations, activations and loss functions. That may solve your problem, but it abstracts away a lot of the details about why it solves it.
MNIST is a great dataset to start with. It’s a collection of images containing 60,000 handwritten digits. It also contains a further 10,000 images that can be used as the test set. It’s been well studied and most frameworks have sample implementations. Here’s an example image:
While it’s useful to reinvent the wheel, we should at least learn from those that have already built wheels before. The first thing I borrowed was the network architecture from TensorFlow. Their example has:
- 28×28 input
- a hidden layer with 512 neurons with ReLU activation
- an output layer with 10 neutrons (representing the 10 possible digits) with Softmax activation
- Cross-Entropy loss function
The next thing to work on was the feedforward part of the network. This is relatively straightforward as these functions are well documented online and the network itself isn’t complicated.
The tough part was working through the back-propagation algorithm. In a previous post, I detailed how to work out the derivatives of the Softmax function and the Cross Entropy loss. The most obvious way is to use the Chain Rule in Differential Calculus to work out the gradients and propagate them back through the network. The steps are pleasing to my eye and appeal to my sense of order in code. (Tip: Use a spreadsheet on a small example network to see the actual matrices in action.)
But (and it’s a big but), the basic approach uses Jacobian matrices. Each cell in these kind of matrices is a partial derivative; each matrix represents a change in every variable with respect to every output. As a result, they can grow rather large very quickly. We run into several issues multiplying very large matrices together. In the notebook, I’ve left the functions representing this approach in for comparison and if you do run it, you’ll notice immediately the problems with speed and memory.
Luckily there are shortcuts, which mean that we can directly calculate the gradients without resorting to Jacobian matrix multiplication. You can see these in the Short Form section of the notebook. In a sense though, these are abstractions too and it’s difficult to see the back-propagation from the shortcut methods.
Lastly, I’ve implemented some code to gather the standard metrics for evaluating how good a machine learning model is. I’ve run it several times and it usually gets an overall accuracy score of between 92% and 95% on the MNIST test dataset.
One of the main things I learned from this exercise is that the actual coding of a network is relatively simple. The really hard part that took a while was figuring out the calculus and especially the shortcuts. I really appreciate now why those frameworks are popular and make coding neural networks so much easier.
If you fancy a challenge, I can recommend working on a neural network from first principles. You never know what you might learn!