In this guide, we're going to discuss an exciting field of unsupervised learning: Generative Adversarial Networks (GANs). This article is based on notes from this course on Deep Learning: GANs and Variational Autoencoders.

GANs were originally proposed in 2014 by Ian Goodfellow when he was a student of Yoshua Bengio's at the University of Montreal.

There's been a lot of excitement around GANs, as Facebook’s AI research director Yann LeCun called adversarial training (GANs in particular) “the most interesting idea in the last 10 years in ML”.

The reason for this excitement is that GANs are extremely good at generating realistic samples, particularly of images.

Here are a few examples of the interesting applications of GANs:

If you want to see more examples of applications of GANs check out this article on the subject.

This guide to GANs is organized as follows:

  1. What are Generative Adversarial Networks (GANs)?
  2. GAN Cost Functions
  3. DCGANs
  4. Keras Implementation of a GAN
  5. Summary of GANs

Let's start with a high-level discussion of GANs.

1. What are Generative Adversarial Networks (GANs)?

A GAN is a collection of two different neural networks: one of which we call the "generator" network, and the other the "discriminator" network.

The idea is that these two neural networks are going to duel with each other.

The role of the generator is to try and fool the discriminator, which it can do by generating realistic samples.

The discriminator's job is is to classify between the samples generated by the generator and real images.

We can see that this is quite a simple concept and that GANs don't contain any components that we haven't already discussed.

As we saw in our earlier guide to Variational Autoencoders, both the Bayes Classifier and Variational Autoencoder deal with probability distributions.

With GANs, on the other hand, we aren't dealing with explicit distributions.

The goal when training a GAN is to reach a Nash equilibrium of a game.

Measuring Quality

Another important distinction with GANs is that we don't have a numerical measure for the quality of samples.

Usually, in machine learning we have a loss function like the negative log-likelihood that we want to optimize. We can then tell how good a model is by looking at its loss.

For example, in supervised learning, we can compare the R2 or accuracy, which is objective.

Assessing the quality of a sample, on the other hand, is very subjective.

For example, asking "does this photo look real?" is subjective.

We know that GANs produce better samples than Variational Autoencoders, but we don't have a number for how much better. Instead, we can just tell by looking at them.

With GANs we have to get used to the idea that we need to use our senses to determine how well our model works.

2. GAN Cost Functions

So what should we use for the GAN cost function?

Since we have 2 neural networks, should we have 2 cost functions?

To answer these questions, let's start with the idea that the Generator and the Discriminator are trying to optimize the "opposite" thing.

Let's start with the Discriminator, which is doing classification with supervised learning.

What type of classification does the discriminator do?

The discriminator is going to receive two types of images: "real" and "fake".

These are 2 different labels, so we're going to be doing binary classification.

As we saw in the article on Variational Autoencoders, the proper cost function for binary classification is binary cross-entropy.

As we can see from this ML cheatsheet on loss functions, in binary classification, where the number of classes J equals 2, cross-entropy can be calculated as:

\[-{(y\log(p) + (1 - y)\log(1 - p))}\]

Now let's move on the generator.

We represent the generator symbol with $G(z)$.

Here $z$ represents a latent prior, which is the same as type of graphical structure that we had with VAEs.

There are two steps to sample from this:

  1. First, we sample $z$ from $p(z)$
  2. We then feed that into $G$ such that $x_hat = G(z)$

If you want to learn more about the mathematics for GANs check out this article.

All we need to know is that there are two optimizations that need to happen:

  • The discriminator wants to discriminate between real and fake images so that it minimizes the negative log-likelihood (the binary cross-entropy)
  • The generator wants to fool the discriminator, and it does this by just trying to maximize the discriminator's cost

So we can say the cost function for the generator is the negative of the cost function for the discriminator: $J(G) = -J(D)$.

In game theory we call this a "zero-sum game" because the sum of all player's costs is always 0.

Zero-sum games are also called "minimax" games because the solution involves both a min and a max.

Once we have the cost function, the next step is to minimize it using one of the gradient descent optimizers available in TensorFlow.

This is an interesting situation as we have 2 different neural networks and 2 different costs in the same script, so we need 2 different optimizers.

A Better Cost Function

Before moving onto the code, there is actually a problem with the cost function defined above.

The problem with the cost function comes from the perspective of the generator.

If the discriminator is very successful at telling the difference between real and fake images, this means that D(G(z)) is very close to 0.

As described in this lecture on GANs:

If we were to change the generator’s weights just a little bit, then $J(G)$ would still be close to 0. This means we’re in a plateau of the minimax cost function, i.e. the generator’s gradient is close to 0, and it will hardly get updated.

In other words, when the discriminator is too good, the generator improves less and less.

The solution to this is to use a different cost function for the generator - the main idea is to "flip the target".

Instead of the target being 0 for fake images, the generator wants the target to be 1.

Check out this lecture if you want to see the math for the new generator cost function.

Since $J(G) + J(D) != 0$, this is no longer a zero-sum game.

This new generator cost function is referred to as a "non-saturating heuristic".

This means the gradient doesn't saturate, in other words, it doesn't converge to some value.

It is "heuristic" because it's just made up to solve a numerical issue, as opposed to deriving this cost theoretically.


Let's discuss the architecture of GANs in more depth.

All we know so far is that we have 2 neural networks, but what kind of neural networks should we use?

In this article, we're focused on GANs for images, as such both feedforward neural networks and convolutional neural networks are appropriate.

One very successful type of GAN was proposed in 2015: DCGAN.

Most GAN architectures today are based on DCGAN, so let's discuss it in more detail.

DCGANs are known for being able to produce high-quality, high-resolution images in a single pass.

Before DCGANs, LAPGANs could generate high-resolution images, but this process for doing so was more complex.

Here's an excerpt from the original DCGAN paper:

We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator.

Here are a few of the features that differentiate DCGANs from other architectures:

  • Batch Normalization
  • All-Convolutional Network
  • Adam Optimizer
  • Leaky ReLU

Batch Normalization

DCGANs uses batch normalization - recall that before we input data into many machine learning algorithms, we want to normalize the data first.

Normalization means subtracting the mean and dividing by the standard deviation:$ z = x - μ / σ$.

In other words, we make sure the data has a mean of 0 and a variance of 1.

With batch normalization, instead of manually normalizing data first, we do normalization at every layer of the neural network.

We can implement batch normalization in TensorFlow with tf.nn.batch_normalization or tf.contrib.layers.batch_norm.

To summarize, batch normalization is essentially doing preprocessing at every layer of the network.

All-Convolutional Network

Another unique feature of DCGANs is that they borrow their architecture from something called the All-Convolutional Network.

This type of network simplifies the traditional LeNet (a series of convolutional and pooling layers, followed by a few fully connected layers).

Adam Optimizer

Another feature of DCGAN is that it uses the Adam optimizer, which is an adaptive gradient descent algorithm like AdaGrad and RMSprop.

Check out this article from Machine Learning Mastery to learn more about the Adam optimization algorithm.

Leaky ReLU

It has been recommended to use the Leaky ReLU for the discriminator, and normal ReLU for the generator.

The Leaky ReLU solves the problem of having "dead neurons" - this happens when the output of the node is 0, so the gradient is 0 and gradient descent has no effect.

Visualizing a DCGAN

Let's look at a photo of the DCGAN from Gluon:

We can see the Generator starts with a 100-dimensional latent vector $z$, projects and reshapes the object, and then does a series of convolutions until we get to a 64 x 64 image.

For the Discriminator we start with a 64 x 64 image, we do a series of convolutions and then do a binary logistic regression at the end.

One thing you may note is that convolutions usually result in an image that's smaller than the original.

So how does the generator do convolutions and produce an image that's bigger?

Enter "deconvolution".

What is deconvolution?

Deconvolution is different from other types of neural network layers because it going from smaller to larger images at each layer.

With regular convolution, we either get something that is the same size or smaller than the original.

If you've ever used Photoshop before you know you can't just enlarge an image without it getting blurrier, so how is the Generator capable of generating larger images at each layer?

The reason is that each image is very fat - the closer you go to the beginning of the Generator the fatter it is, meaning more data is contained in feature maps.

These feature maps contain a lot of information.

At each layer, we're transferring data from feature maps into the spatial dimensions of the image.

So what kind of convolution do we need to obtain a result that is larger than the input?

To do this we use a fractionally-strided convolution.

If we do a convolution with a stride of 2 the result will be 1/2 the original.

So if we do a convolution with a stride of 1/2 the result with be 2x the original.

To implement this in TensorFlow we can't use the existing convolutional functions with a fractional stride (a float), as we'll get an error saying that it only accepts integers.

Instead, we're going to use:

tf.nn.conv2d_transpose(value, filter,output_shape, strides,    padding='SAME', name=None)

Let's run the TensorFlow implementation of DCGAN from Lazy Programmer:


4. Keras Implementation of a Generative Adversarial Network

Here's an example of a GAN implemented in Keras:

# source:
class GAN():
    def __init__(self):
        self.img_rows = 28
        self.img_cols = 28
        self.channels = 1
        self.img_shape = (self.img_rows, self.img_cols, self.channels)

        optimizer = Adam(0.0002, 0.5)

        # Build and compile the discriminator
        self.discriminator = self.build_discriminator()

        # Build and compile the generator
        self.generator = self.build_generator()
        self.generator.compile(loss='binary_crossentropy', optimizer=optimizer)

        # The generator takes noise as input and generated imgs
        z = Input(shape=(100,))
        img = self.generator(z)

        # For the combined model we will only train the generator
        self.discriminator.trainable = False

        # The valid takes generated images as input and determines validity
        valid = self.discriminator(img)

        # The combined model  (stacked generator and discriminator) takes
        # noise as input => generates images => determines validity
        self.combined = Model(z, valid)
        self.combined.compile(loss='binary_crossentropy', optimizer=optimizer)

    def build_generator(self):

        noise_shape = (100,)

        model = Sequential()

        model.add(Dense(256, input_shape=noise_shape))
        model.add(Dense(, activation='tanh'))


        noise = Input(shape=noise_shape)
        img = model(noise)

        return Model(noise, img)

    def build_discriminator(self):

        img_shape = (self.img_rows, self.img_cols, self.channels)

        model = Sequential()

        model.add(Dense(1, activation='sigmoid'))

        img = Input(shape=img_shape)
        validity = model(img)

        return Model(img, validity)

    def train(self, epochs, batch_size=128, save_interval=50):

        # Load the dataset
        (X_train, _), (_, _) = mnist.load_data()

        # Rescale -1 to 1
        X_train = (X_train.astype(np.float32) - 127.5) / 127.5
        X_train = np.expand_dims(X_train, axis=3)

        half_batch = int(batch_size / 2)

        for epoch in range(epochs):

            # ---------------------
            #  Train Discriminator
            # ---------------------

            # Select a random half batch of images
            idx = np.random.randint(0, X_train.shape[0], half_batch)
            imgs = X_train[idx]

            noise = np.random.normal(0, 1, (half_batch, 100))

            # Generate a half batch of new images
            gen_imgs = self.generator.predict(noise)

            # Train the discriminator
            d_loss_real = self.discriminator.train_on_batch(imgs, np.ones((half_batch, 1)))
            d_loss_fake = self.discriminator.train_on_batch(gen_imgs, np.zeros((half_batch, 1)))
            d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)

            # ---------------------
            #  Train Generator
            # ---------------------

            noise = np.random.normal(0, 1, (batch_size, 100))

            # The generator wants the discriminator to label the generated samples
            # as valid (ones)
            valid_y = np.array([1] * batch_size)

            # Train the generator
            g_loss = self.combined.train_on_batch(noise, valid_y)

            # Plot the progress
            print ("%d [D loss: %f, acc.: %.2f%%] [G loss: %f]" % (epoch, d_loss[0], 100*d_loss[1], g_loss))

            # If at save interval => save generated image samples
            if epoch % save_interval == 0:

    def save_imgs(self, epoch):
        r, c = 5, 5
        noise = np.random.normal(0, 1, (r * c, 100))
        gen_imgs = self.generator.predict(noise)

        # Rescale images 0 - 1
        gen_imgs = 0.5 * gen_imgs + 0.5

        fig, axs = plt.subplots(r, c)
        cnt = 0
        for i in range(r):
            for j in range(c):
                axs[i,j].imshow(gen_imgs[cnt, :,:,0], cmap='gray')
                cnt += 1
        fig.savefig("gan/images/mnist_%d.png" % epoch)

if __name__ == '__main__':
    gan = GAN()
    gan.train(epochs=30000, batch_size=32, save_interval=200)

5. Summary of Generative Adversarial Networks

The basic principle of GANs is that we have two neural networks, the generator and the discriminator, that both learn from each other.

The generator learns to produce good samples because the discriminator learns to tell the difference between a generated sample and a real image.

GAN Cost Functions

We looked at the GAN cost function and started with binary cross entropy for the discriminator, but saw that there is a problem if we set the generator to the negative of it.

To resolve this we created a new generator cost to be a non-saturating heuristic.

We then looked at a specific architecture that performs well: DCGAN.

A few notable features of the DCGAN include batch normalization, the Adam optimizer, and fractionally-strided convolutions.

You may have noticed that the cost doesn't converge when we train a GAN. Usually, in machine learning the cost function decays nicely if everything is working correctly.

The reason this doesn't happen with GANs is that each network is constantly trying to minimize it's own cost, which ends up increasing the others.

So how do we interpret these results?

Interestingly, GANs can be thought of from the perspective of reinforcement learning.

In particular, the generator is like the agent, which is taking random actions in the beginning.

We then get back a reward from the discriminator $D(G(z))$ and its gradient.

The generator never actually sees what real images look like.

Instead, it just learns to generate realistic images based on rewards from the discriminator.

To conclude, GANs are powerful tools that perform extremely well at generating realistic samples of images.