In the field of deep learning, variational autoencoders and generative adversarial networks (GANs) have been two of the most interesting developments in the past few years.
In this article we will focus on variational autoencoders, and in the next article we will discuss GANs. The following article is based on notes from this course on Deep Learning: GANs and Variational Autoencoders and is organized as follows:
- Introduction to Variational Autoencoders
- Review of Generative Modeling
- What is a Variational Autoencoder?
- Variational Autoencoder Architecture
- Latent Space, Predictive Distributions, and Samples
- Training an Autoencoder
- Implementing a Variational Autoencoder in TensorFlow
- Summary of Variational Autoencoders
Stay up to date with AI
Introduction to Variational Autoencoders
As we will discuss, variational autoencoders are a combination of two big ideas:
- Bayesian machine learning
- Deep learning
We essentially take a problem that is formulated using a Bayesian paradigm and transform it into a deep learning problem that uses a neural network and is trained with gradient descent.
What is Unsupervised Learning?
Unsupervised learning means we're not trying to map inputs to targets, but rather we're trying to learn the structure of our inputs.
There are several things we can do once we've learned this structure, a few of which include generating text, generating art, and generating music.
Before we get into variational autoencoders, let's first review generative modeling.
Review of Generative Modeling
Sampling with a Bayes Classifier
Let's first review what it means to "sample" from a learned model. To do this, we'll first build a Bayes Classifier.
A Bayes Classifier is a generative model, which means for each class $y$ we model the distribution
p(x | y) rather than directly modeling
p(y | x).
As we'll see, we're going to use this to generate samples.
There are many ways to learn such a distribution
p(x | y), however a standard method is to fit a Gaussian to the data.
Here are the steps to do this:
- To find
p(x | y)all we need to do for each
yis first find all the x's that belongs to class
- Then we take the mean and covariance of these
x's and save them
- We can then make a classification decision based on Bayes rule by taking the $argmaxy$ of
p(y | x), which is equal to $argmaxy$:
[p(x | y) p(y)/p(x)].
There are a few ways to sample from our Bayes classifier, but one way is that we could pick a class, for example we choose $y = 1$.
We know that $p(x | y = 1)$ is a Gaussian so we can sample from this using SciPy.
Here is the code for this from LazyProgrammer using the MNIST dataset:
As we can see these are all quite blurry, which is what we will be working to fix. That said, the images are not indistinguishable from digits, so it's a good starting point.
Gaussian Mixture Model
So why was our single Gaussian model blurry?
The reason is because we tried to force a single Gaussian to fit a multi-modal distribution.
What do we mean by multi-modal distribution?
Recall that a mode is the most common value of a random variable. In terms of a continuous random variable that has a probability density function (PDF)—a mode is any local maximum of the PDF.
When we say multi-modal distribution we just mean a distribution with multiple peaks, as we can see Wikimedia:
This makes sense for images of digits since not everyone writes digits the same way, but you can imagine there are a finite number of ways to write a digit.
For example, if 50 people write a 7 one way, and 50 people write it another, we can model this as a bi-modal distribution.
How can we model a multi-modal distribution in a generative Bayes classifier?
To do this we can make use of Gaussian Mixture Models.
What makes GMMs useful is that they can fit multiple Gaussians in different proportions to approximate a multi-modal distribution.
Here are a few more important points about GMMs:
- GMM is a latent variable model - unsupervised learning is all about latent variables
- We call the latent variable $z$ and it represents which cluster $x$ belongs to
The final PDF of a GMM looks like our Bayes classifier, for example for a model with 2 clusters:
p(x) = p(z=1) p(x | z=1) + p(x=2) p(x | z=2)
We can think of $p(z)$ as the prior probability that any $x$ belongs to a certain cluster.
p(z) is a categorical or discrete distribution, and it tells us which cluster an
x is likely to belong to, without looking at any
Let's say we're given an
x and we want to know which cluster it belongs to, we can find
p(z | x) using Bayes rule.
Here are a few more important points about GMMs:
- GMM is trained using the expectation-maximization (EM) algorithm
- We use EM for latent variables because we can't find a maximum-likelihood solution of the parameters in closed-form
- EM is iterative, but just does maximum-likelihood
Why is this important?
One key component of variational autoencoders is variational inference.
Variational inference is like a Bayesian extension of the expectation-maximization (EM) algorithm.
One of the weaknesses of GMMs is that we have to choose
K, the number of clusters, and if we choose wrong our model doesn't perform well.
The variational inference version of GMM (VI-GMM), on the other hand, contains an infinite number of clusters. Most of these clusters remain empty so the VI-GMM automatically finds the number of clusters for you.
In this article we'll make use of SciKit Learn's built in VI-GMM.
To implement this we're just going to take our previous Bayes classifier and replace the single Gaussian with a Variational GMM:
from sklearn.mixture import BayesianGaussianMixture
Here's what we get when we run the Bayes classifier GMM file:
We can see from these images that they are a bit less blurry from the single Gaussian version.
So why should we can about generative modeling in the first place?
If you want to become a machine learning master, we need to go beyond supervised learning and into unsupervised learning.
Some unsupervised learning algorithms directly support supervised learning so it helps improve it.
Similarly, generative modeling is may necessarily not directly solve our problem, but instead can support our overall efforts.
So what are the applications of generative modeling?
One application is reinforcement learning.
As discussed in our Guide to Reinforcement Learning, in RL an agent must learn by interacting with its environment.
Thus, a key part of reinforcement learning is the environment.
For example, we don't want to have to test a new self-driving car algorithm on the road, instead we want to simulate the driving experience.
Being able to generate these simulations with generative models is very useful.
If you want to read more about generative models in the context of reinforcement learning, check out this article from OpenAI that showcases four projects.
There are many other applications, a few of which include:
- Image augmentation
- Image-to-Image translation
- Natural language generation
Generative models have been around for a while now although the quality wasn't that good—what makes variational autoencoders and GANs so interesting is that the samples they generate can be exceptionally good.
Before we get to variational autoencoders, let's quickly review what an autoencoder is:
- An autoencoder is the simplest type of unsupervised neural network we can build
- It is a neural network that predicts (reconstructs) its own input
What is a Variational Autoencoder?
A variational autoencoder (VAE) is a type of neural network that learns to reproduce its input, and also map data to latent space.
A VAE can generate samples by first sampling from the latent space.
We will go into much more detail about what that actually means for the remainder of the article.
Let's break this into each term: "variational" and "autoencoder":
As defined earlier, an autoencoder is just a neural network that learns to reproduce its input.
An important feature of autoencoders is that typically we choose a number of hidden units that is less than the number of inputs.
This feature creates a bottleneck as it forces the neural network to learn a compact representation of the data.
What does this bottleneck actually mean?
Suppose we teach a neural network to reproduce its input—let's say we have an image input size of 784 dimensions and there are 100 hidden layers.
This means we've learned to represent the image as a much smaller amount of code instead. What this means is that out of 784 numbers, many of them were redundant.
The "true" amount of information from the image must have then been less than 784 numbers.
Now what does "variational" mean?
Variational refers to variational inference or variational Bayes.
These techniques fall into the category of Bayesian machine learning.
One way to think about variational inference is that it's an extension of expectation-maximization (EM) algorithm that we saw earlier.
The EM algorithm is used when we have a latent variable model, in which we can't maximize
p(x) directly. An example of this is the Gaussian Mixture Model we saw earlier.
EM gives us a point estimate of the parameters, in other words it can be seen as a frequentist statistical method.
What variational inference does is extend this idea to the Bayesian realm where instead of learning point estimates of parameters, we learn the distributions of the parameters instead.
While it is recommended to learn more about variational inference, it is not actually required to understand the implementation of variational autoencoders.
To summarize, variational autoencoders combine autoencoders with variational inference.
Let's now look at the architecture of variational autoencoders.
Variational Autoencoder Architecture
As we know a VAE is a neural network that comes in two parts: the encoder and the decoder.
These are split in the middle, which as discussed is typically smaller than the input size.
We typically call the values at the hidden layer
z and they represent the latent variable representation of the input data.
So how does data flow through this neural network?
One key point is that it does not work like a traditional autoencoder, which works like a feedforward neural network (with one hidden unit).
With variational autoencoders something different happens at the end of the encoder—in particular, we don't get a value but instead get a distribution. More precisely, we get the parameters of a distribution.
Recall that Bayesian machine learning is all about learning distributions instead of learning point estimates.
So instead of finding
z, we are finding
q(z) which tells us the PDF of
That's the first half of the VAE—now we have to go through the decoder.
We now have a distribution
q(z), from this we need actual numbers to pass in through the rest of the neural network.
To do this we draw a sample from
Now that we have a sample z vector we can pass it through the decoder as usual: we multiply it by the weights, add a bias, and apply an activation function.
When we get to the output of the decoder we once again have a distribution.
Let's assume our input is a binary variable, so our output is also a binary variable - in other words they only have values of 0 and 1.
This is the Bernoulli distribution and is represented by one parameter
p, which tells us the probability of getting a 1.
As we know a
sigmoid gives us a value between 0 and 1, therefore
sigmoid is the appropriate activation function here so that the output of the decoder can represent Bernoulli distributions.
To summarize, the output of the decoder represents a probability distribution.
Since the output is a distribution, this affects how we use it.
With binary classification in regular neural networks we just round the probability to get a prediction.
With variational autoencoders, the paradigm is different.
At the output of the decoder we have a distribution - and from this distribution we can generate samples.
To summarize the forward pass of a variational autoencoder:
- A VAE is made up of 2 parts: an encoder and a decoder
- The end of the encoder is a bottleneck, meaning the dimensionality is typically smaller than the input
- The output of the encoder
q(z)is a Gaussian that represents a compressed version of the input
- We draw a sample from
q(z)to get the input of the decoder
- We then pass this through the neural network as usual
- At this point we're going to treat the input and outputs as binary variables - meaning the output distribution is Bernoulli - and we use a
sigmoidactivation function to get the parameters
- Our use of the distribution is to generate samples, instead of rounding for a prediction like in normal binary classification
Parameterizing a Gaussian
As discussed the output of the encoder is going to be a distribution, rather than a value.
Before we continue, let's demonstrate this point.
Let's say our encoder has the sizes (4, 3, 2), which means the input data has:
- An input dimensionality of 4
- 3 hidden layers
- The size of the latent vector is 2
This means at the end
q(z) is a 2-D Gaussian.
So how can we parameterize this Gaussian?
To do this we can make the final layer size double what it's specified as - in our case we need a final layer size of 4 for a 2-D Gaussian.
The first to components represent the mean, and the last 2 for standard deviation.
There is one more important problem - we know that the standard deviation σ must be a strictly positive number, but a neural network can output any number.
A simple solution to fix this is to use the
softplus activation function - the reason is that it is smooth, continuous, differentiable, and always greater than 0.
Here is what we get when we run parameterize_gassian.py:
mean: [0.02381251 1.01605719]
Here is a summary of this code:
- We're doing a forward pass through a neural network as usual
- The output of the neural network is not an output, but a distribution
- From this distribution
q(z)we can generate samples, which is what we'll do in the variational autoencoder
Latent Space, Predictive Distributions, and Samples
An important concept that applies to variational autoencoders and regular autoencoders is latent space.
Recall that our neural network is made up of two steps: encoder + decoder.
- The encoder is responsible for turning the input data
xinto a different vector
q(z), or a "coded" version of
- This coded version of x lives in "latent space"
- Taking a vector in this latent space we can pass it through the decoder and get back
x_hat- which is an image that looks like the original
We can think of this as a compression and a decompression operation.
To compress an image we transfer it into some code, and we decompress it by converting it back into an image.
Why is this important?
This idea is again borrowed from Bayesian machine learning, in particular the:
- Posterior predictive distribution
- Prior predictive distribution
From a distribution we can generate samples:
- Posterior predictive sample
- Prior predictive sample
For the posterior predict sample we follow the steps already described:
- We pass an input image into an encoder, which gets mapped to a Gaussian distribution
q(z | x)
- We sample from
q(z | x)to get
- We pass
zthrough the decoder to get a distribution p(x_hat | x) for the reconstructed version of the image
- We can then sample an image from this output distribution
Our other option is prior predictive sampling, in which:
- Instead of getting
zfrom the input
x, we can sample
zfrom a standard normal
- Once we get z we can pass it through the decoder as usual
By doing this we can get an image that looks like it's from the training data, called a prior predictive sample.
Because we're sampling from the standard normal distribution, we don't know which digit we're going to get.
The key for this method is that we build our optimization algorithm in a way that encourages the encoder to map the training data around the standard normal distribution.
So when we sample from the standard normal it should represent something from the training data.
Training an Autoencoder
Machine learning models typically have 2 main functions that we're interested in: learning and inference.
In the context of deep learning, inference generally refers to the forward direction
In other words making predictions in the context of supervised learning, and transforming data into latent representations in the context of unsupervised learning.
In scikit-learn these are typically done with
fit(X, y) and
The next step for understanding variational autoencoders is to discussing fitting and training.
What we want to do is define a cost function, and then try and minimize it.
The cost function for a variational autoencoder will look a little strange compared to what we're used to.
The objective function we want to optimize is called the "ELBO", or the the evidence lower bound:
In statistics, the evidence lower bound (ELBO, also variational lower bound) is the difference between the distribution of a latent variable and the distribution of the respective observed variable
If you want to learn the math behind ELBO, check out this great article on the subject.
In TensorFlow the optimizer only has a minimizer function, so we're going to minimize the negative of ELBO.
Implementing a Variational Autoencoder in TensorFlow
Here is the basic outline of how we're going to implement a variational autoencoder in TensorFlow:
- VAE is a neural network with a cost function, and we want to minimize it via gradient descent
- We will build a
VariationalAutoencoderclass with a
- The fit(X) function performs gradient descent on the cost function we've defined
- We will then plot the cost to confirm that it converges (decreases)
In order to build the cost function we need to define how to go from the input to the reconstruction
We already saw how to write the encoder and go from an input
x to the Gaussian parameters
q(z | x).
We know that the next step is get a sample
z from the Gaussian and then feed that sample to the decoder to get the reconstruction
p(x_reconstruction = 1 | x).
The tricky part is taking the sample from the distribution, because once you take a sample from a distribution nothing that came before it is differentiable.
So how can we make the parameters of the encoder differentiable after drawing a sample?
To do this we can make use of a few special library functions in TensorFlow.
The next question is how do we build the cost function?
We didn't get into the math earlier, but there are 2 parts for the cost function:
- The expected log-likelihood
- The KL-divergence
Here are the important points about expected log-likelihood:
- This is binary cross-entropy, or more accurately negative binary cross-entropy between input data
xand the reconstruction
- We can make use of TensorFlows binary cross-entropy function
Next we have the KL-divergence:
- TensorFlow comes with a function called
- It takes in two distribution objects as arguments and outputs the KL-divergence
Once we have the expected log-likelihood and the KL-divergence we can calculate the ELBO, and from there it is straightforward to create an optimizer to do gradient descent.
Finally, what functions do we want the VAE to compute?
We know that we want get
x_reconstructed from the input
x - recall that we call this the posterior predictive sample.
We also want to draw a sample from
p(z) = N(0, 1), which is a standard normal and then generate a sample from that. We call this the prior predictive sample.
Finally will also want a function to map each part of the latent space of
z to an image, and for this we will be using the Bernoulli means.
- We need to build the forward operation using a stochastic tensor to sample
- From there we build the cost function using built-in TensorFlow functions and optimize it
- We also want our autoencoder to have some standard operations for posterior predictive samples and prior predictive samples
Let's implement a variational autoencoder in TensorFlow with vae_tf.py:
Summary of Variational Autoencoders
Variational autoencoders combine techniques from deep learning and Bayesian machine learning, specifically variational inference.
Variational autoencoders learn how to do two main things:
- Reconstruct the input data
- It contains a bottleneck, which means the autoencoder has to learn a compact and efficient representation of data
One of the major differences between variational autoencoders and regular autoencoders is the since VAEs are Bayesian, what we're representing at each layer of interest is a distribution.
At the end of the encoder we have a Gaussian distribution, and at the input and output we have Bernoulli distributions.
The forward operation of a VAE is a little different than a regular neural network - when we get to the end of the encoder we draw a sample from the Gaussian and pass the sample through the decoder.
The cost function of a VAE is the combination of two terms: the expected log likelihood and the KL-divergence.
Expected log-likelihood is responsible for the reconstruction penalty, and KL divergence is responsible for the regularization penalty.
Next we looked at the cost function from a probabilistic perspective, and defined that we want to accurately approximate
p(z | x) with
q(z | x).
To accomplish this we derived 2 expressions for ELBO, evidence lower bound, each of which help us in different ways:
- RHS is the cost function which we can calculate
- LHS shows us why ELBO is an appropriate objective - as it increases the KL divergence between
p(z | x)decreases, which was our original goal
In summary, variational autoencoders let us build complex generative models of data, and yield state-of-the-art performance for image generation.