Originally published on 09/10/2019, updated on 10/21/2022.
In this guide, we provide an overview of a class of deep learning commonly applied to image data: convolutional neural networks.
A convolutional neural network is a sub-class of the deep learning family that use a variation of multilayer perceptrons.
Before defining how convolutional neural networks work, let's review a few of the applications of CNNs.
1. Applications of Convolutional Neural Networks
CNNs achieve state-of-the-art performance in a variety of areas, including:
- Image classification and object detection
- Voice-user interfaces
- Natural language processing
- Computer vision
1. Text Classification
You can read a great article on implementing a CNN for text classification using TensorFlow here.
2. Machine Translation
Another example of CNNs for machine translation is from Facebook, which:
published research results using a novel convolutional neural network (CNN) approach for language translation that achieves state-of-the-art accuracy at nine times the speed of recurrent neural systems.
3. Playing Video Games
Another example of CNN application is playing Atari games using one of our favorite topics at MLQ - Reinforcement Learning and CNNs. DeepMind published a paper that describes a:
DeepRL system which combines Deep Neural Networks with Reinforcement Learning at scale for the first time, and is able to master a diverse range of Atari 2600 games to superhuman level with only the raw pixels and score as inputs.
In this example, DeepMind used CNNs to teach artificially intelligent agents to play video games such as Atari breakout.
You can watch a great video from Two Minute Papers here that demonstrates the agent learning to play the game and discovering a game hack that even the algorithm developers didn't know existing.
4. Playing Board Games
The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves.
To win against the world's best Go player Lee Sedol, AlphaGo trained on 160,000 games recorded from top Go players, and then on 30 million more games it played against itself.
2. What are Convolutional Neural Networks?
In general, CNNs look at images and learn to identify spatial patterns such as shapes and colors. The shapes, colors, and anything else that distinguishes an image are called features.
A CNN can learn to identify these features and then can be used for image classification.
How do computers interpret images?
Images are interpreted by computers as an array.
This array consists of a grid of values, where each grid cell is called a pixel, and each pixel has a numerical value.
In one of the most famous databases in the deep learning world - the MNIST dataset, which contains grey-scale hand-written digits. Each image in this dataset is 28 by 28 pixels, so it is interpreted as a 28 x 28 array.
In a typical grey-scale image, white pixels have a value of 255, black pixels have a value of 0, and grey pixels fall somewhere in between. Color images have similar numerical representations for each pixel color.
Normalization is an important step in pre-processing our data.
The reason that we want normalized pixel values is that neural networks rely on gradient calculations.
Neural networks are essentially trying to determine how important, or how weighted, a certain pixel should be in determining the class of an image.
Normalizing the pixel values helps the gradient calculations stay consistent.
3. Image Classification
Now that we have normalized our data, how can we go about classifying the images?
One way to do this is with a Multilayer perceptron (MLP).
MLPs only take vectors as inputs, so to use them we have to first convert our image array into a vector. This process is referred to as flattening.
MLP Structure & Class Scores
After normalizing our data, we create a neural network for discovering patterns in the data.
After training, our model should be able to look at new images (i.e. unseen data) and classify the images. The unseen data is referred to as our test data.
As we mentioned the images are 28 x 28, so there should be 784 entries in the first input layer of our neural network.
We want our output layer to be equal to the number of classifications being made - in the case of the MNIST digits dataset this equals 10 nodes, one for each possible class 0-9.
The output layers are often referred to as Class Scores, which indicate how confident the network is that a given input belongs to a specific class.
A higher value means the network is more certain the image belongs to a certain class.
The Class Scores are often represented as a vector of values indicating the relative strength of the scores.
The part of the MLP structure that is up to the Machine Learning Engineer to decide is the number of hidden layers, or the number of layers in between the input and output layer.
We must decide how many hidden layers to use, and how many nodes should be in each one.
This is a question we encounter often as we define neural networks for a variety of tasks, not just image classification.
To help solve this, it is recommended to read papers on the specific task at hand.
For example, "MLP for MNIST hidden layers" we see from the Keras repository that 3 hidden layers should do the trick.
Both layers use 512 nodes with a
ReLu activation function.
We know that the more hidden layers we add the more complex patterns this network will be able to detect, but we want to avoid adding unnecessary complexity.
4. Loss & Optimization
Now we have our MLP structure, how does the network actually learn from our dataset?
As an example, let's say we take an image of a handwritten 3 and feed it into the network.
If the Class Scores the network incorrectly thinks it is actually an 8, we can tell our network to learn from the mistake.
To measure any mistakes the network is making we use a loss function, whose job it is to measure the difference between the predicted and true class labels.
Using backpropagation we can calculate the gradient of the loss with respect to the models' weight. In order words, we quantify how bad a particular weight is in order to determine which weights in the network are responsible for the errors.
Using this calculation we can choose an optimization function, which gives us a way to calculate a better weight value.
A common way to do this (as is shown in the Keras repo) is to apply a
softmax activation function to the output layer to convert the Class Scores into probabilities.
Now, the output is the model's predicted probability of each image class.
In order to get the model's prediction closer to the truth, we need to define a measure of exactly how far off the model currently is perfection.
Since we're creating a multi-class image classifier, we use the
categorical cross-entropy loss.
As the model trains, the goal will be to find the weights that minimize this loss function and output the most accurate predictions.
The standard method for minimizing the loss and optimizing for the best weight values is called Gradient Descent.
5. Model Validation
So how do we determine the exact number of epochs to train our model for?
We of course want the model to be accurate, but not overfit the training data.
One method that's used in practice is breaking the dataset into 3 sets:
- Training Set
- Validation Set
- Test Set
Each set is treated separately by the model.
The model looks at the training set only when it's training and deciding how to modify its weights.
After each training epoch, we check how the model is doing by looking at the training loss and the loss on the validation set.
It's important to note, however, that the training set does not use any of the validation set during backpropagation.
Since the model doesn't use the validation set for deciding its weights it can tell us if we're overfitting the training set.
To summarize model validation:
- We use the training set to update the model weights.
- We use the validation set to check how well the model generalizes to a dataset separate from the training set.
- The test set is used to check the accuracy of the trained model.
Image Classification: Step-by-Step Review
So far we've discussed how to approach the task of image classification, let's review this process step-by-step.
- Load the dataset and visualize our data
- Preprocess the data by normalizing it and converting it to a tensor
- Define a model architecture - in this step it's useful to research how others have approached this task or a similar one before
- Train the model - we do this by defining loss and optimization functions and then proceed with training the model
- Save the best model - consider using a validation set to select and save the best model during training
- Test the trained model on unseen data
6. Convolutional Neural Networks for Image Classification
So far we have used MLPs for image classification.
For the MNIST database, which is very clean and pre-processed and MLP will do fine for this task.
For most image classification tasks, however, Convolutional Neural Networks are far superior to MLPs.
In the case of real-world messy image data, CNNs are the way to go.
To understand why, remember that with MLPs we first convert the image into a vector. The MLP treats this converted image as a simple vector of numbers with no special structure. It doesn't take into account that these numbers were originally spatially arranged in a grid.
CNNs, on the other hand, are built for the exact purpose of working with the patterns of multidimensional data.
CNNs understand that image pixels that are closer together are more related than pixels that are far apart.
Before moving forward let's review a few differences between CNNs and MLPs.
- Only use fully connected layers
- Only accept vectors as input
- Also use sparsely connected layers
- Also accept matrices as input
7. Convolutional Layers
A convolutional neural network is a special kind of neural network in that it can remember spatial information.
The networks we've looked at so far for the image classification task only look at individual inputs.
Convolutional neural networks, on the other hand, can look at the image as a whole (or in regions) and analyze groups of pixels at the same time.
The key to preserving this spatial information is something called the convolutional layer.
The convolutional layer is produced by applying a series of many different image filters, also known as convolutional kernels, to an input image.
The resulting filtered images have different appearances, which may have extracted certain features like the edges of an object in the image.
For example, in the case of handwritten digits and the MNIST database, the CNN should learn to identify spatial patterns such as the curves that make up the number 8 as opposed to the lines that make up the number 7.
Subsequent layers in the neural network will learn how to combine different color and spatial features to produce an output of our class labels.
Filters that Define a Convolutional Layer
When we talk about spatial patterns in an image, we're often talking about either colors or shapes.
Shapes can be though of as patterns of intensity in an image. Intensity is a measure of light and dark, similar to brightness. We can use this knowledge to detect the shape of objects in an image.
You can often determine the edges of shapes by looking for abrupt changes in intensity.
In image processing, filters are used to filter out unwanted or irrelevant information in an image.
The final type of layer we need before building a convolutional neural network are called pooling layers.
Pooling layers often take convolutional layers as input.
As we mentioned a convolutional layer is a stack of feature maps where we have one feature map for each filter.
A complicated dataset with many object categories requires a large number of filters, although as we add filters this increases the dimensionality of our convolutional layers.
Higher dimensionality means we need to use more parameters, which can lead to overfitting.
The role of pooling layers in a convolutional neural network is thus to reduce this dimensionality.
One type of pooling is called a Max Pooling Layer, which takes a stack of feature maps as input.
As with convolutional layers, we define a window size and stride.
Convolutional neural networks discover patterns contained in an image, and a sequence of layers is responsible for this discovery.
The layers of a CNN convert an input array into a representation, which is often called the feature-level representation of an image, or a feature vector.
When we are classifying objects in images there is a lot of irrelevant information that we have to deal with.
All we want our algorithm to do is determine if the object is present in the image or not. In other words, we want our algorithm to learn an invariant representation of the image.
- We don't care about the size of the object, which is referred to as scale invariance.
- We don't care about the angle of the object, which is referred to as rotation invariance.
- We don't care about whether it is on the left or right side, which is referred to as translation invariance.
Convolutional neural networks have some built-in translation invariance.
8. CNN Architectures
Three groundbreaking CNN architectures that won the ImageNet Large Scale Visual Recognition Competition (ILSVRC) include:
- AlexNet in 2012 - read the paper here
- VGGNet in 2014 - read the paper here
- ResNet in 2015 - read the paper here
If you want to see the code for CNNs you can find our tutorial on How to Build a Convolutional Neural Network in Python with Keras here, where we look at both the MNIST and Fashion MNIST datasets.
9. Summary: What are Convolutional Neural Networks
Convolutional neural networks are a sub-class of deep learning family that achieves state-of-the-art performance in image analysis, as well as many other tasks by following a series of steps.
- A CNN first takes in an input image and puts it through several convolutional and pooling layers
- The result is a set of feature maps reduced in size from the original image
- Through a training process, the feature maps have learned to distill information about the content in the original image
- We then flatten these maps to create a feature vector that we can pass to a series of fully-connected linear layers to produce a probability distribution of class scores
- From this, we extract the predicted class for the input image
In short, we pass as image as input through the CNN and a predicted class label comes out.
It's important to note that CNNs are not restricted to the image classification task. Convolutional neural networks can be applied to any task that has a fixed number of outputs.