In this guide we provide an overview of a class of deep learning commonly applied to image data: convolutional neural networks.

A convolutional neural network is a sub-class of the deep learning family that's commonly applied to analyzing visual images.  As we'll see, CNNs use a variation of multilayer perceptrons.

Before defining how convolutional neural networks work, let's review a few of the applications of CNNs.

1. Applications of Convolutional Neural Networks

CNNs achieve state of the art performance in a variety of areas including:

  • Image classification and object detection
  • Voice-user interfaces
  • Natural language processing
  • Computer vision

An example of voice-user interfaces is Google's WaveNet model, which can take text as input and output computer generated audio of a human reading the text.

In terms of natural language processing, Recurrent Neural Networks (RNNs) are used more commonly than CNNs since NLP is a sequential problem. CNNs can be used in this area, however, for tasks such as extracting information from sentences.

1. Text Classification

You can read a great article on implementing a CNN for text classification using TensorFlow here.

2. Machine Translation

Another example of CNNs for machine translation is from Facebook, which:

published research results using a novel convolutional neural network (CNN) approach for language translation that achieves state-of-the-art accuracy at nine times the speed of recurrent neural systems.

3. Playing Video Games

A third example is playing Atari games using one of our favorite topics at MLQ - Reinforcement Learning and CNNs. DeepMind published a paper that describes a:

DeepRL system which combines Deep Neural Networks with Reinforcement Learning at scale for the first time, and is able to master a diverse range of Atari 2600 games to superhuman level with only the raw pixels and score as inputs.

In this example DeepMind used CNNs to teach artificially intelligent agents to play video games such as Atari breakout.

You can watch a great video from Two Minute Papers here that demonstrates the agent learning to play the game and discovering a game hack that even the algorithm developers didn't know existing.

4. Playing Board Games

Finally, another example of CNNs from DeepMind is the famous AlphaGo. From the paper they released in Nature:

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves.

To win against the world's best Go player Lee Sedol, AlphaGo trained on 160,000 games recorded from top Go players, and then on 30 million more games it played against itself.

2. What are Convolutional Neural Networks?

In general, CNNs can look at images and learn to identify spatial patterns such as shapes and colors.

The shapes, colors, and anything else that distinguishes an images are often called features.

A CNN can learn to identify these features and then can be used for image classification.

How do computers interpret images?

Images are interpreted by computers as an array.

This array consists of a grid of values, where each grid cell is called a pixel, and each pixel has a numerical value.

In one of the most famous database in the deep learning world - the MNIST dataset, which contains grey-scale hand-written digits. Each image in this dataset is 28 pixels high by 28 pixels wide, so it is interpreted by a computer as a 28 x 28 array.

In a typical grey-scale image, white pixels have the value of 255, black pixels have the value 0, and grey pixels fall somewhere in between.

Color images have similar numerical representations for each pixel color.


Normalization is an important step in pre-processing our data.

The reason that we want normalized pixel values is because neural networks rely on gradient calculations.

Neural networks are essentially trying to determine how important, or how weighted, a certain pixel should be in determining the class of an image.

Normalizing the pixel values helps the gradient calculations stay consistent.

3. Image Classification

Now that we have normalized our data, how can we go about classifying the images?

One way to do this is with a Multilayer perceptron (MLP).

MLPs only take vectors as inputs, so to use them we have to first convert our image array into a vector. This process if referred to as flattening.

MLP Structure & Class Scores

After normalizing our data, we create a neural network for discovering patterns in the data.

After training, our model should be able to look at new images (i.e. unseen data) and classify the images. The unseen data is referred to as our test data.

As we mentioned the images are 28 x 28, so there should be 784 entries in the first input layer of our neural network.

We want our output layer to be equal to the number of classifications being made - in the case of the MNIST digits dataset this equals 10 nodes, one for each possible class 0-9.

The output layers are often referred to as Class Scores, which indicates how sure the network is that a given input belongs to a specific class.

A higher value means the network is more certain the image belongs to a certain class.

The Class Scores are often represented as a vector of values indicating the relative strength of the scores.

The part of the MLP structure that is up to the Machine Learning Engineer to decide is the number of hidden layers, or the number of layers in between the input and output layer.

We must decide how many hidden layers to use, and how many nodes should be in each one.

This is a question we encounter often as we define neural networks for a variety of tasks, not just image classification.

To help solve this, it is recommended to read papers on the specific task at hand.

For example "MLP for MNIST hidden layers" we see from the Keras repository that 3 hidden layers should do the trick.

Both layers use 512 nodes with a ReLu activation function.

We know that the more hidden layers we add the more complex patterns this network will be able to detect, but we want to avoid adding unnecessary complexity.

4. Loss & Optimization

Now we have our MLP structure, how does the network actually learn from our dataset?

As an example, if we take an image of a handwritten 3 and feed it into the network.

Let's say from the Class Scores the network incorrectly thinks it is actually an 8, we can tell our network to learn from the mistake.

To measure any mistakes the network is making we use a loss function, whose job it is to measure the difference between the predicted and true class labels.

Using backpropagation we can compute calculate the gradient of the loss with respect to the models' weight. In order words we quantify how bad a particular weight is in order to determine which weights in the network are responsible for the errors.

Using this calculation we can choose an optimization function, which gives us a way to calculate a better weight value.

A common way to do this (as is shown in the Keras repo) is to apply a softmax activation function to the output layer to convert the Class Scores into probabilities.

Now, the output is the model's predicted probability of each image class.

In order to get the models prediction closer to the truth, we need to define a measure of exactly how far off the model currently is perfection.

Since we're creating a multi-class image classifier, we use the categorical cross-entropy loss.

As the model trains, the goal will be to find the weights that minimize this loss function and output the most accurate predictions.

The standard method for minimizing the loss and optimizing for the best weight values is called Gradient Descent.

5. Model Validation

So how do we determine the exact number of epochs to train our model for?

We of course want the model to be accurate, but not overfit the training data.

One method that's used in practice is breaking the dataset into 3 sets:

  1. Training Set
  2. Validation Set
  3. Test Set

Each set is treated separately by the model.

The model looks at the training set only when it's training and deciding how to modify its weights.

After each training epoch we check how the model is doing by looking at the training loss and the loss on the validation set.

It's important to note, however, that the training set does not use any of the validation set during backpropagation.

Since the model doesn't use the validation set for deciding its weights it can tell us if we're overfitting the training set.

To summarize model validation:

  • We use the training set to update the model weights.
  • We use the validation set to check how well the model generalizes to a dataset separate from the training set.
  • The test set is used to check the accuracy of the trained model.

Image Classification: Step-by-Step Review

So far we've discussed how to approach the task of for image classification, let's review this process step-by-step.

  1. Load the dataset and visualize our data
  2. Preprocess the data by normalizing it and converting it to a tensor to that its prepped to be processing by the layers of a neural network
  3. Define a model architecture - in this step we need to research how others have approached this task or a similar one before
  4. Train the model - we do this by defining loss and optimization functions and then proceed with training the model
  5. Save the best model - consider using a validation set to select and save the best model during training
  6. Test the trained model on unseen data

6. Convolutional Neural Networks for Image Classification

So far we have used MLPs for image classification.

For the MNIST database, which is very clean and pre-processed and MLP will do fine for this task.

For most image classification tasks, however, Convolutional Neural Networks are far superior to MLPs.

In the case of real-world messy image data, CNNs are the way to go.

To understand why, remember that with MLPs we first convert the image into a vector.  The MLP treats this converted image as a simple vector of numbers with no special structure. It doesn't take into account that these numbers were originally spatially arranged in a grid.

CNNs, on the other hand, are built for the exact purpose of working with the patterns of multidimensional data.

CNNs understand that image pixels that are closer together are more related than pixels that are far apart.

Before moving forward let's review a few differences between CNNs and MLPs.


  • Only use fully connected layers
  • Only accept vectors as input


  • Also use sparsely connected layers
  • Also accept matrices as input

7. Convolutional Layers

A convolutional neural network is a special kind of neural network in that it can remember spatial information.

The networks we've looked at so far for the image classification task only look at individual inputs.

Convolutional neural networks, on the other hand, can look at the image as as a whole (or in regions) and analyze groups of pixels at the same time.

The key to preserving this spatial information is something called the convolutional layer.

The convolutional layer is produced by applying a series of many different image filters, also known as convolutional kernels, to an input image.

The resulting filtered images have different appearances, which may have extracted certain features like the edges of an object in the image.

For example, in the case of handwritten digits and the MNIST database, the CNN should learn to identify spatial patterns such as the the curves that make up the number 8 as opposed to the lines that make up the number 7.

Subsequent layers in the neural network will learn how to combine different color and spatial features to produce an output of our class labels.

Filters that Define a Convolutional Layer

When we talk about spatial patterns in an image, we're often talking about either colors or shapes.

Shapes can be though of as patterns of intensity in an image. Intensity is a measure of light and dark, similar to brightness. We can use this knowledge to detect the shape of objects in an image.

You can often determine the edges of shapes by looking for abrupt changes in intensity.

In image processing, filters are used to filter out unwanted or irrelevant information in an image.

Pooling Layers

The final type of layer we need before building a convolutional neural network are called pooling layers.

Pooling layers often take convolutional layers as input.

As we mentioned a convolutional layer is a stack of feature maps where we have one feature map for each filter.

A complicated dataset with many object categories requires a large number of filters, although as we add filters this increases the dimensionality of our convolutional layers.

Higher dimensionality means we need to use more parameters, which can lead to overfitting.

The role of pooling layers in a convolutional neural network is thus to reduce this dimensionality.

One type of pooling is called a Max Pooling Layer, which takes a stack of feature maps as input.

As with convolutional layers we define a window size and stride.

Feature Vectors

Convolutional neural networks discover patterns contained in an image, and a sequence of layers is responsible for this discovery.

The layers of a CNN convert an input array into a representation, which is often called the feature level representation of an image, or a feature vector.

Image Augmentation

When we are classifying objects in images there is a lot of irrelevant information that we have to deal with.

All we want our algorithm to do is determine if the object is present in the image or not.  In other words, we want our algorithm to learn an invariant representation of the image.

  • We don't care about the size of the object, which is referred to as scale invariance.
  • We don't care about the angle of the object, which is referred to as rotation invariance.
  • We don't care about whether it is on the left or right side, which is referred to as translation invariance.

Convolutional neural networks have some built-in translation invariance.

8. CNN Architectures

Three groundbreaking CNN architectures that won the ImageNet Large Scale Visual Recognition Competition (ILSVRC) include:

If you want to see the code for CNNs you can find our tutorial on How to Build a Convolutional Neural Network in Python with Keras here, where we look at both the MNIST and Fashion MNIST datasets.

9. Summary: What are Convolutional Neural Networks

Convolutional neural networks are a sub-class of deep learning family that achieves state-of-the-art performance in image analysis, as well as many other tasks by following a series of steps.

  • A CNN first takes in an input image and puts it through several convolutional and pooling layers
  • The result is a set of feature maps reduced in size from the original image
  • Through a training process, the feature maps have learned to distill information about the content in the original image
  • We then flatten these maps to create a feature vector that we can pass to a series of fully-connected linear layers to produce a probability distribution of class scores
  • From this, we extract the predicted class for the input image

In short, we pass as image as input through the CNN and a predicted class label comes out.

It's important to note that CNNs are not restricted to the image classification task. Convolutional neural networks can be applied to any task that has a fixed number of outputs.

Further Resources