Tensor Processing Units (TPUs) for Accelerated Machine Learning

A Tensor Processing Unit (TPU) is a custom computer chip designed by Google specifically for deep learning.

5 years ago • 6 min read

By Peter Foy

The demand for deep learning and neural networks has grown significantly over the past few years.

All of Google's core products use deep learning including search, translation, photo, gmail, the Google assistant.

As the chart below from this live stream shows, TPU - or Tensor Processing Unit - development started at Google in 2013 and production use started in 2015.

What is a Tensor Processing Unit (TPU)?

A Tensor Processing Unit (TPU) is a custom computer chip designed by Google specifically for deep learning.

There are many machine learning models such as random forests, support vector machines, and neural networks.

One of the main reasons deep learning has become so popular is that it has been shown to outperform all other machine learning models, almost 100% of the time.

So why did Google decide to make their own chips rather than just use CPUs or GPUs?

One of the features of neural networks that makes it so powerful, is that performance does not plateau as much as traditional algorithms as you feed it more data.

If you give neural networks more and more data they start to outperform all other models.

What are neural networks?

We won't go into detail here, but neural networks are models created with linear algebra.

Neural networks are just a chain of matrix operations applied to input data.

We can think of think of our input data as a matrix of numbers, which we give to a black box neural network that performs a series of operations, which then provides an output.

For example, if we feed the neural network an image this is broken down to RGB pixel values between 0 and 255.

We mentioned that neural networks outperform other models when you give it a lot of data, and this translates to a lot of matrix operations to compute.

Most of the math is just multiplying a bunch of numbers, and adding the results.

We can actually connect these two operations into a single operation, called multiply-accumulate (MAC).

Stay up to date with AI

Moore's Law

Invented by Gordon Moore, Moore's Law predicted that computing power would double every two years.

So far this prediction has held true, although it has plateaued.

As Steve Blank says:

For 40 years the chip industry managed to live up to that prediction. The first integrated circuits in 1960 had ~10 transistors. Today the most complex silicon chips have 10 billion. Think about it. Silicon chips can now hold a billion times more transistors.

Since pretty much all of Google's products now use deep learning, the death of Moore's law is, of course, not ideal.

So how do we get past the limits of Moore's Law and what hardware can do for deep learning?

To answer this let's review CPUs, GPUs, and TPUs.

Central Processing Units (CPUs)

A CPU is a scalar machine that accepts single values. This means it processes instructions one step at a time.

CPUs can perform matrix operations, but they're not in parallel, they're sequential.
A CPU has multiple cores, and each of these cores contains Arithmetic Logic Units (ALU), Control, Cache memory, and DRAM.
CPUs are for low compute density, have a complex logic control, and are optimized for serial operations.
CPUs are good for rapid prototyping that requires flexibility
CPUs work well for building small models

Graphical Processing Units (GPUs)

GPUs are similar to CPUs but they have hundreds of cores, as opposed to just 4 in CPUs.

GPUs were designed for 3D game rendering, which involves a lot of parallel processing.
GPUs have a high compute density, high Computations per Memory Access, and are optimized for a lot of parallel operations on numbers.
We can think of a GPU as a vector machine, where a vector is a 1-dimensional array of numbers that can be operated on at the same time.
GPUs are general purpose chips, meaning they can do any kind of computation, not just matrix operations.
GPUs are good for machine learning models not written in TensorFlow
GPUs are good for medium-to-large models

So how do we improve on GPUs that do well with vector operations?

Tensor Processing Units (TPUs)

A TPU is an ASIC, or Application Specific Integrated Circuit (another popular use of ASICs is Bitcoin mining).

There have been 3 generations of TPUs since 2015.
If you want to see a presentation by Google about TPUs, check out this demo site

So why doesn't Bitcoin mining use GPUs? Because mining ASICs are designed specifically to mine Bitcoin.

The question is then can we apply the same logic to deep learning?

Yes we can.

The TPU was built as an ASIC for the specific matrix operations that deep learning requires. The downside of this is that TPUs are inflexible.

As Google notes in this blog post:

When Google designed the TPU, we built a domain-specific architecture. That means, instead of designing a general purpose processor, we designed it as a matrix processor specialized for neural network work loads.

A few more points to note about TPUs include:

TPUs are great for TensorFlow made models that are dominated by matrix computations (i.e. neural networks)
TPUs are particularly useful for giant models, i.e. ones that train for weeks or moths

A great example of where TPUs are useful is in machine translation, which requires a huge amount of data to train the models.

Case Study: Predict Shakespeare with Cloud TPUs & Keras

Here's the example Google provides for using cloud TPUs in Google Colab.

This example uses tf.keras to build a language model and train it on a Cloud TPU. This language model predicts the next character of text given the text so far. The trained model can generate new snippets of text that read in a similar style to the text training data.

Using a cloud TPU, the model trains for 10 epochs in approximately 5 minutes.

We first download The Complete Works of William Shakespeare from Project Gutenberg.

We're then going to train the model on the combined works of Shakespeare, and use the model to compose a play in the style of The Great Bard:

The model:

The model is defined as a two-layer, forward-LSTM with Keras
The tf.contrib.tpu.keras_to_tpu_model function converts a tf.keras model to an equivalent TPU version
We then use the standard Keras methods to train: fit, predict, and evaluate

Predictions:

We're then going to use the trained model to make predictions and generate our own Shakespeare-esque play.
We start the model off with a seed sentence, then generate 250 characters from it.
The model makes five predictions from the initial seed.

Prediction 0: