*This article was originally posted on 04/20/2019 and updated on 11/09/2022.*

This guide is dedicated to understanding the application of neural networks to reinforcement learning.

Deep reinforcement learning is at the cutting edge of what we can do with AI.

From self-driving cars, superhuman video game players, and robotics - deep reinforcement learning is at the core of many of the headline-making breakthroughs we see in the news.

Reinforcement learning has been around since the 1970's, but the true value of the field is only just being realized.

Before we get into deep reinforcement learning, let's first review supervised, unsupervised, and reinforcement learning.

## Supervised vs. Unsupervised vs. Reinforcement Learning

Here's a recap from our introductory Guide to Reinforcement Learning:

- Reinforcement learning is quite different from both of supervised & unsupervised machine learning techniques
- Supervised and unsupervised learning algorithms are making predictions
**about**data**-**for example classifying an image, or identifying anomalies in data. - Reinforcement learning is about training an
**agent**to operate in an environment through interaction in order to**maximize reward**

In reinforcement learning, our agent is interacting with data, but the key difference is that its actions affect the environment, and the agent has a goal it wants to reach.

In this article we will be making use of OpenAI Gym.

This tool allows us to train reinforcement learning agents in simulated environments.

Using this tool we will look at examples of a few RL environment classics, such as:

- Cart-Pole
- Mountain Car
- Atari games like Breakout

Let's start by reviewing key concepts in reinforcement learning.

## Key Concepts in Reinforcement Learning

The concepts we will review include:

- Markov Decision Processes (MDPs)
- Dynamic Programming
- Monte Carlo Methods
- Temporal Difference Learning
- Approximation Methods for Reinforcement Learning
- Deep Learning
- Introduction to OpenAI: CartPole
- RBF Neural Networks
- TD Lambda
- Policy Gradient Methods
- Deep Q-Learning
- A3C: Asynchronous Advantage Actor-Critic
- Summary: Deep Reinforcement Learning

This guide is based on notes from this course: Advanced AI: Deep Reinforcement Learning in Python.

## 1. Markov Decision Processes (MDPs)

A Markov Decision Process is really just a collection of 5 things:

- A set of states
- A set of actions
- A set of rewards
- A state transition probability
- And a discount factor (gamma)

Let's review each one of these in a bit more detail.

#### States

- A state is the observation that an agent receives from the environment.
- For example, in the game of Go it would be your positions on the board.
- In a video game the state would be the pixels on the screen, and could include things like your agents health, points, lives left, etc.

#### Actions

- An action represents all possible actions that an agent can take in a given state.
- This could be moving up or down in a video game, moving a piece on a board, etc.

#### Rewards

- Rewards are provided to an agent at each step of the game by the environment.
- The goal of an agent is to maximize cumulative total future reward

#### State-Transition Probabilities

- There are a few reasons we need to assign a probability to transitioning from one state to another
- One is that not all environments are deterministic and can have a source of randomness
- The other is that the observation of the state can sometimes be imperfect

#### Discount Factor

- Since we are dealing is time-based environments in reinforcement learning, we need to discount rewards that are further in the future
- The reason is that the further we look into the future, the harder it is to predict

This brings us to the next important concept.

#### Value Function

- Since rewards are probabilistic, and our total reward is the sum or our rewards, the total expected reward is also probabilistic
- The value function is an estimate of our reward from this point onwards

#### State-Value and Action-Value

- The state-value is denoted V(s)
- The action-value is denoted $Q(s, a)$
- We typically use $Q$ for learning the optimal policy, which we call the
*control problem* - Finding the state-value or action-value given a fixed policy is called the
*prediction problem* - A policy is denoted by $π$, can be either deterministic or probabilistic

#### Episodes

- Another key concept is that of an episode.
- For example, this could be finishing a game of Go
- Tasks that have a defined end point - like video games - are called
**episodic tasks** - Tasks that have no end-point - like trading in the markets - are called
**continuing tasks** - We call the final state in which the episode ends the
**terminal state,**and since the value function is the expected*future*reward, the value of the terminal state is always 0

## 2. Dynamic Programming

The first solution to Markov Decision Processes (MDPs) is called dynamic programming.

Dynamic programming was pioneered by Richard Bellman, and also gave us the Bellman Equation.

The Bellman equation allows us to define the value function recursively, and this is a great article if you want to learn more about the Bellman equation.

Several dynamic programming algorithms include:

#### Iterative Policy Evaluation

- This is called the prediction problem
- Given a policy π, it finds the value function

#### Policy Iteration

- We use this for solving the control problem
- The policy iteration is inefficient because the outer loop is iterative, which means we have to go back and forth from policy evaluation and policy improvement until the policy converges

#### Value Iteration

- The solution to Policy Iterations inefficiency is Value Iteration
- Instead of doing many iterations of the policy evaluation to converge, we just do 1 iteration and the algorithm still converges
- We also don't need to do the policy improvement step at all - since it's the
`argmax`

we just put a`max`

into the value update step and we're effectively doing both policy evaluation and policy improvement at the same time - This idea of taking a
`max`

when updating the value function is also part of Q-Learning

To summarize, dynamic programming provides a foundation for reinforcement learning, but it is not very practical.

Since we need to loop through all the states on every iteration they grow exponentially in size, and the state space can be very large or infinite.

Dynamic programming also requires a model of the environment, specifically knowing the state-transition probability - $p(s', r | s, a)$. As you can imagine, trying to estimate these over all states and actions is a difficult task.

Dynamic programming also doesn't require the agent to play the game, so it's not really learning from experience.

## 3. Monte Carlo Methods

Unlike Dynamic Programming, Monte Carlo methods are all about learning from experience.

Any expected value can be approximated by sample means - in other words, all we need to do is play a bunch of episodes, gather the returns, and average them.

It's important to note that Monte Carlo methods only give us a value for states we've encountered, and if we never encounter a state its value is unknown.

One disadvantage of Monte Carlo is that it doesn't always work.

Remember that it requires us to play a bunch of episodes...but what if we're not doing an episodic task?

Another problem with Monte Carlo is that it can leave many states unexplored.

## 4. Temporal Difference Learning

Temporal difference (TD) learning is unique to reinforcement learning.

With Monte Carlo we need to sample returns based on an episode, whereas with TD learning we estimate returns based on the estimated current value function.

To do this we look at `TD(0)`

- instead of sampling the return `G`

, we estimate `G`

using the current reward and the next state value.

**This allows for true online machine learning.**

Instead if needing to wait for the entire episode to finish like in Monte Carlo, we only need to wait until `t+1`

in order to update the state value at time `t`

.

## 5. Approximation Methods for Reinforcement Learning

Dynamic programming, Monte Carlo, and Temporal Difference really only work well for the smallest of problems.

These methods don't work that well for games that get to billions, trillions, or an infinite number of states.

**Approximation methods solve this problem.**

Supervised learning can be used for function approximation, and we are interested in estimating V or Q.

To do this we convert each state `s`

into a feature vector and treated the return as the target `G`

.

Since reward is a real value, so too is our target variable - therefore we are going to use regression and the appropriate loss function is squared error.

In addition to linear models, all deep learning techniques can be used for approximation.

We can also use libraries like TensorFlow to do automatic differentiation.

## 6. Deep Learning

As we've discussed in other posts, the term deep learning and neural networks are used interchangeably.

Neural networks are really just multiple logistic regressions stacked together.

The layers in between our input data and output are referred to as hidden layers.

This network is referred to as a **feedforward neural network.**

What makes neural networks a nonlinear approximation method is that they have a nonlinear activation function in between each layer.

It is this nonlinearity that allows us to model complex functions.

One thing to note is that with deep networks like gradient descent are sensitive to the hyperparameters we set, such as:

- The learning rate
- The # of hidden layers
- The # of hidden units
- The activation function
- The optimizer

Since deep neural networks are sensitive to hyperparemeters we always need to experiment with these, and it can be more of an art than a science.

One of the nice things about neural networks is that they save us from having to do do a lot of manual feature engineering.

The reason for this is the its nonlinearity has shown to automatically learn features between the layers of our network.

#### Convolutional Neural Networks (CNNs)

Since we are dealing with image data so much when teaching a robot how to interact in an environment, convolutional neural networks work well here.

We won't get into CNNs here, but if you can check out these guides on the subject:

- What are Convolutional Neural Networks? A Complete Guide to CNNs
- How to Build a Convolutional Neural Network in Python with Keras

To summarize them - CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing.

#### Recurrent Neural Networks (RNNs)

In reinforcement learning we are also dealing with sequential data, since each episode is made up a sequence of states, actions, and rewards.

The main class of deep learning to model sequential data are recurrent neural networks (RNNs).

If you want to read more about RNNs check out our Complete Guide to Recurrent Neural Networks.

To summarize - any network in which a node loops back to an earlier node is recurrent neural network.

#### Classification vs. Regression

Many of the introductory machine learning start with classification problems - like classifying handwritten digits with CNNs.

In reinforcement learning, however, we want to use __regression__ to predict a real-value, which is the __value function__.

This means that we will have 1 output node and will use squared error instead of crossentropy error.

## 7. OpenAI Gym Environments: CartPole with Bins

Let's start with a simple example of reinforcement learning: CartPole with Bins.

Before adding complexity, we're first going to solve with a tabular method: Q Learning.

CartPole has the following features:

- The state space consists of continuous variables, in other words the state space is infinite
- There are some states that are more likely than others

To solve the problem of infinite state spaces we can cut the relevant part of the state space into boxes (since some parts are unreachable) to create a discrete state space.

Now we can use the tabular method.

It's important to note that the idea of cutting up state spaces into boxes has some hidden complexities, for example:

- How do we choose the lower and upper limits of the box?
- What if we end up in a state outside the box?

### Implementation

To implement this in code we to convert the state into a bin so that we can use it to index a dictionary or an array.

The default reward is +1 at each time step - and this doesn't work very well.

What makes more sense is giving a larger negative reward (-100) if the pole falls, which incentives the agent to not reach this point.

The code for CartPole with Bins using Q-Learning can be found here.

After 10000 episodes the average reward for the last 100 episodes was 181.46.

From OpenAi Gym CartPole documentation:

CartPole-v0 defines "solving" as getting average reward of 195.0 over 100 consecutive trials.

We are close but not there yet.

## 8. RBF Neural Networks

RBF stands for Radial Basis Function, and they allow us to proceed to the next step of using function approximation.

There are two ways to think about an RBF Network...

- The first way is that it is a linear model with feature extraction, where the feature to be extracted is a RBF kernel
- The next way is to think of it as a 1-layer hidden neural network, with an RBF kernel as the activation function

#### So what is a Radial Basis Function?

An RBF is a normalized Gaussian bell curve, centered at some center point that we call $c$.

- $x$ is an input vector
- $c$ is a vector that lives in the same vector space as the input vectors
- The function only depends on the distance between $x$ and $c$, hence the term "radial"

**So how do you choose c and how many c's should we use?**

- The number of $c$ is the number of hidden units there will be in the hidden layer
- Each hidden unit corresponds to a different RBF with a different center

These centers are also called *exemplars*, and there are a few different ways to choose them...

- In Support Vector Machines (SVM), the number of exemplars is equal to the number of training points, and
*they are*in fact the training points themselves. - SVMs were once thought to be superior to neural networks, but this is why SVMs are not as popular today - they don't scale to the data sizes that we have today.

Another way to choose exemplars is to just sample a few points from the state space, which allows us to limit the number of exemplars we use. We can do this with `env.observation_space.sample()`

.

At this point the number of exemplars we choose is similar to how many hidden units we use in a neural network, meaning it is a hyperparameter that must be tuned.

The capability to do the RBF kernel transformation is built into Sci-Kit Learn. The `RBFSampler`

is a Monte Carlo algorithm that allows to perform the computation much faster.

The standard sci-kit learn interface is as follows:

- Create an instance
- Call the
`fit`

function - And then we can call
`transform`

```
from sklearn.kernel_approximation import RBFSampler
sampler = RBFSampler()
sampler.fit(raw_data)
features = sampler.transform(raw_data)
```

As mentioned, the other way to think about RBF networks is as a 1-hidden layer neural network.

One of the functions we'll be plotting below is called the **cost-to-go function.**

- This is the negative of the optimal value function V*(s)
- Because the state space is only 2 variables, this allows us to plot the cost-to-go function in 3D.

Alright enough theory, let's now code an implementation of Q-Learning with an RBF network.

**OpenAI Gym Environments: RBF Networks with Mountain Car**

This code is from LazyProgrammer Github and from this Deep Reinforcement Learning course.

Here are the results:

- The average reward for last 100 episodes was -141.98

## 9. TD Lambda

Let's now extend our knowledge of Temporal Difference methods.

In particular, there is a more general method of $TD(0)$ that is called TD Lambda.

As we'll see, TD Lambda gives you a tradeoff between one step learning methods, where $λ=0$ gives us $TD(0)$, and $λ = 1$ gives use Monte Carlo.

Any other λ is a trade-off between the two.

The first step to understanding TD(λ) is to understand the N-step, which works as follows:

- We know that the value function can be described recursively with Bellman's equation
- We also know that we can estimate $V$ by taking the sample mean of returns (G's) for many episodes
- We also know from $TD(0)$ that we can estimate $G$ by using our current estimate of $V$

It's important to note that the most accurate thing in our estimate of $G$ is the reward $*r$ *because it's an actual sample we got from our experiment.

It is thus plausible to ask whether if we use more r's, and less of V, that our estimate of G becomes more accurate.

Since Monte Carlo means that our agent can only be updated after the episode is over, and $TD(0)$ means the agent can be updated after one step, then using N-steps of the reward means you have to wait N-steps before the agent can be updated.

It's key to note that the only change is to $G$ itself, and the same thing applies to tabular methods and function approximation methods.

Let's look at how the N-step method works with MountainCar, the relevant code can be found here.

The average rewards for the last 100 episodes was -131.46.

#### TD Lambda

Let's now discuss $TD(λ)$ as a generalization of the N-step method.

$λ$ is a parameter associated with vector called "eligibility trace", which essentially keeps track of old gradients, much like momentum in deep learning.

Since the N-step method code method is complicated, $TD(λ)$ gives us a more elegant method for the trade-off between $TD(0)$ and Monte Carlo.

In particular, we won't have to keep track of the last N steps and we can update after just 1 step like in TD(0).

To recap:

- $λ = 0 $gives us $TD(0)$
- $λ = 1$ gives us Monte Carlo
- Any other $λ$ is a trade-off of the two

In the $TD(λ)$ method, the λ is given specific values - in particular they decrease geometrically.

Since this doesn't sum to 1 we need to normalize it, and we want to normalize for the the case where n = ∞.

Since sum of these $λ$'s is $1 / (1 - λ)$, we should also scale the sum of the returns by this amount. We call this the $λ-return$.

**Let's run some TD(λ) code on MountainCar, from this repo:**

**To summarize, TD(λ) can be applied to any situation where you would use Q-Learning, SARSA, or Monte Carlo.**

- The N-Step method provides a bridge between TD(0) and Monte Carlo
- We can combine all the N-step returns up to infinity and this gives us the λ-return

## 10. Policy Gradient Methods

Let's now look at another method of solving the control problem: Policy Gradient.

We have so far been parameterizing the value function, but we can also parameterize the policy.

In particular, the goal is to find the optimal policy: π*.

So what does a parameterized policy look like?

We know that the policy has to be a probability - in particular, we can score each action a using a linear model, or any other kind of model.

We can then use the `softmax`

function to turn these scores into probabilities.

It's important to note that for a policy to be considered optimal, it must have some objective.

So the question is: what should the objective be?

We won't get into the math behind it in this article, but this article on the subject does a great job.

Instead, let's talk about why we would want to use the policy gradient method.

The policy gradient method yields a probabilistic policy. This is similar to epsilon-greedy, except policy gradient is much more *expressive. *With epsilon greedy, all the sub-optimal actions have the same probability, even though some might be better than others.

With a policy gradient method, we can model these differences.

For example, it may be optimal to perform an action A 90% of the time, B 6% of the time, and C 4%.

Another reason we want to use policy gradient methods is that states may be stochastic.

One of the sources of this randomness could be that the state does not give you full information about the environment. For example, in trading you don't know exactly what other traders are going to do next.

To account for this, the optimal action must be probabilistic to account for different possibilities.

**To summarize policy gradient methods:**

- We can parameterize the policy with a
`softmax`

output which gives us a probabilistic policy - The objective of the policy is to optimize expected return from the start state V(S
_{0}) - We call this objective the "performance"

#### Policy Gradient with TensorFlow for Cartpole

Here are the results after running this code:

**Policy Gradient for Continuous Action Spaces**

Let's extend our knowledge of policy gradient methods by looking at continuous action spaces.

So far we've just looked at the Cartpole and Mountain Car environments, which both have discrete action spaces.

Luckily, the MountainCarContinous environment lets us test continuous action spaces.

#### Defining a Continuous Policy

Let's think about our current policy model:

- It allows us to choose from a set of discrete action action space
- The main idea is that it's a probability distribution

So how can we go about creating a distribution for a continuous action space?

Technically we can choose any distribution, but let's start with Gaussian.

Gaussian distributions have 2 parameters - mean and variance.

This means that in order to create a parameterized policy, we need to parameterize the mean and variance.

It's important to remember that the policy gradient method doesn't change because we create a different policy model.

One final thing to mention is that continuous action space is impossible with Q-Learning.

Q-Learning and Deep Q-Learning with function approximation both allow us to deal with infinite state spaces, but not infinite action spaces.

With Policy Gradient methods, however, continuous action spaces are possible.

#### MountainCarContinuous with TensorFlow

The reward structure is different for the MountainCarContinuous environment, in particular:

- If you get to the goal, you get +100 reward
- Subtracted from the reward, is the sum of squared actions
- If the size of your actions is larger, you get a larger penalty
- Your agent is thus incentivized to take smaller actions

This environment is considered solved if you can get a total reward of 90 averaged over 100 episodes.

To solve this we can either use a type of search called hill climbing, or we can use gradient descent.

Here are the results after running this code, which uses gradient descent:

#### Summary: Policy Gradient Methods

To summarize:

- Instead of just modeling the value function $V(s)$ or $Q(s, a)$, we also create a model for the policy $π(a | s)$
- This allows us to create a probabilistic policy

To do this, we need to:

- Parameterize the policy
- Optimize the policy by creating a new objective function

**One of the main benefits of policy gradient methods is that it is suitable for continuous action spaces.**

Instead of outputting a discrete action with `softmax`

, we output a gaussian (or other) distribution of actions.

## 11. Deep Q-Learning

Let's now build on all of the components we've learnt and discuss Deep Q-Learning.

Deep Q-Learning can be used for much more complex games than the algorithms we've look at so far, such as Atari games.

Before we get into Deep Q-Learning, here are a few practical issues:

- It takes a long time to train an agent
- The reason for this is the size of the state space in Atari games like Breakout are much bigger than the MountainCar and Cartpole
- Basically, the patterns our agent has to learn are much more complex

Some agents will take over a week to train, even on a GPU. The computing cost of this can add up quickly, especially if we want to test different hyperparameter settings.

We are still going to discuss the concepts and write the code, but testing it yourself can be impractical.

#### Deep Q-Learning Techniques

A Deep Q-Network is really just a deep neural network used for Q-Learning, meaning it is just a function approximator that is used for $Q(s, a)$.

One important subtlety of DQN is that rather than transforming the state-action pair into a feature, we instead only transform the state into an input feature and have multiple nodes - each representing a different action `a`

.

An important part of DQN is that we use what's called** Experience Replay.**

Experience Replay is a step forward from Monte Carlo, which returns a set of state-action-return $(s, a, G)$ that we use to train the function approximator.

With Experience Replay on the other hand, we use a replay buffer (the size of which is chose by the programmer and is a hyperparameter).

Inside the replay buffer we store 4 tuples of state, action, reward, next state $(s, a, r, s')$. To train the DQN we sample a random mini-batch from the replay buffer and use that as training data. The buffer acts as a queue - or FIFO (first-in first-out), so it always contains the 4 more recent tuples.

When we take these random samples we get a better representation of the true data distribution.

Since with Q-Learning training happens on every step, we initially build the buffer with random experience. For example, it will just keep taking random actions until the buffer is filled.

As discussed earlier, Q-Learning is a Temporal Difference method, which use the value function as the target return. Because of this, when we use gradient descent it is not a true gradient - we call it a semi-gradient.

For Deep Q-Learning this leads to instability.

The solution to this is to introduce another Deep Q-Network, which we call the "target network". This network is responsible for creating the target for the TD error. It is essentially a copy of the original DQN, but isn't updated as often.

#### Convolutional Neural Network

Since we're dealing with images in the form of raw pixels from the screen, we're going to use a convolutional neural network.

One problem with still images is that you can't tell which way they are moving - for example, is the ball in Breakout moving up or down?

Because of this, we can't just use one image.

So with Deep Q-Learning we use a convolutional neural network, but instead of just using the previous frame, we use the previous 4 frames.

We also convert the image to grayscale since color isn't always useful for games (but this depends on the game).

This leaves us with a 3 dimensional tensor for each image - (frame #, height, width), so we're essentially just replacing color with time.

This we're already used to working with 3-D tensors with CNNs (that uses color), so we don't need to build a special convolution for this.

#### To summarize, DQN:

- Uses a deep neural network to approximate the value function Q (s, a), where each output node corresponds to a value for a specific action
- With experience replay, we store 4 tuples of (s, a, r, s') in a buffer that we use for training
- We use a target network, which is a periodically updated version of the main Q-Network - this generates the target for the TD error
- We use a convolutional neural network with the previous 4 frames in grayscale to represent each input state

Here are the results of implementing deep Q-Learning with TensorFlow:

#### Playing Atari Games with Deep Q-Learning

Let's start with the Atari game Breakout.

The difference between Breakout and Cartpole or Mountain car is that the input is much bigger.

- We're stacking 4 frames per state
- Each frame is 210x160x3
- We can make the image grayscale, which makes it 210x160
- In total we have 210x160x4 for the size of each state
- This gives us 134, 400 inputs

Luckily the Atari games are pretty simple so we can downsample and crop the images without losing any information.

**We are also going to use built-in TensorFlow layers for Atari games.**

With convolutional neural networks we know that we need to flatten the input in order to connect the final layer with the first fully connected layer.

This requires us to calculate the convolution output size manually, which is not a trivial task.

Using built in layers from TensorFlow makes this more manageable, including:

`tf.contrib.layers.conv2d`

`tf.contrib.layers.fully_connected`

The code for this can be found here.

#### Partially Observable Markov Decision Processes (MDPs)

Partially observable MDPs mean that we don't have full information about the environment.

In this case, we can combine observations to get a more accurate depiction of the state we're in.

The fact that we can combine observations across time is a key concept. In deep learning, this would be considered a sequence of observations.

As we know, the way that we deal with sequences is with recurrent neural networks (RNNs).

We can use an RNN to approximate the policy or action-value to make a function approximator that depends on previous states as well as the current state.

One thing to note is that deep Q-Learning with RNNs is computationally complex.

One complication is that we need to randomly sample from the replay buffer, but if we take random samples they won't be in a sequence anymore.

Possible solutions to this include:

- We do the same thing we did for CNN Deep Q Networks - each item in the replay buffer is a sequence of 4 observations. Instead of treating them as 4 colors we treat them as 4 subsequent steps.
- The downside is this limits the length of our sequences, and if we choose a longer length it requires more memory
- Another solution is to sample randomly from the experience replay buffer, but don't treat the samples as indexes to samples, but instead a random start or end of a sequence (for example, the starting point for a sequence of length 10)

#### Summary: Deep Q-Learning

With Deep Q-Learning we didn't really do anything new - and all of the changes were simply hyperparameter changes.

To recap Q-Learning:

- In the basic form of Q-Learning we take the TD(0) error and do gradient descent on the Q-table (or Q approximator)
- Deep learning is also an approximation method, but we can't just plug in any neural network to RL

So the question is: what can we do to make neural networks work with reinforcement learning?

- We use experience replay
- We always sample from the most recent set of experiences because when we use TD error we're not doing "true" gradient descent - we're doing an approximation of gradient descent
- That is because our target is not a true target, it is itself a prediction from the model
- We can also use a separate target network that is copied periodically from the main network
- We also discussed the idea that we don't have to include information about the state at the current time, we can also incorporate information from previous frames
- This lets us infer velocity and motion from still frames and allows us to predict what will happen next

With these techniques we saw that we can combine deep learning with reinforcement learning. This allows us to tackle more complex video games like Breakout, and other Atari games.

## 12. A3C: Asynchronous Advantage Actor-Critic

There's actually no new theory that we need to implement the popular A3C reinforcement learning algorithm.

We covered everything in the Policy Gradient section - recall that the final form was called the "actor-critic" method.

To recap:

- The actor is a neural network that parameterizes the policy
- The critic is another neural network that parameterizes the value function
- The advantage is the term we use to measure the return (G) in state s minus the value V(s) at state s = G - V(s)

One difference between policy gradients and A3C is that instead of using TD(0) return, we use the N-step return.

Another minor difference is that we're going to regularize the policy loss by adding the entropy as a regularization term.

In practice, adding entropy encourages exploration.

Entropy is somewhat like variance in that it measures the spread of a distribution.

- We get maximum entropy when each event has equal probability
- We get 0 entropy when all the probability is in a single event (i.e. deterministic)

Another improvement with the A3C algorithm is we can now incorporate neural networks as our function approximator.

Another interesting part about A3C is that it is asynchronous - it takes the algorithm we had before and makes it asynchronous.

What does that mean?

- In computing we now like to run things in parallel
- A3C has a similar concept where we achieve stability through having multiple parallel copies of our agent playing the game
- This is what allows us to make use of neural networks as our function approximator

Both A3C and Deep Q-Learning try and solve the problem - how can we make use of neural networks for function approximation, they just do it in different ways.

**Summary: A3C**

- We learned the theory of A3C from Policy Gradients
- The difference with A3C is that we replace the TD(0) with the N-step method
- With A3C we have a bunch of agents working in parallel
- We take the average result from each agent, which leads to more stability
- This is key since stability is often lacking in RL, and this is the reason we can't simply plug in a neural network into any RL algorithm
- With Deep Q-Learning we solve this problem by creating an experience replay buffer, and having a target network
- With A3C we solve this by having agents working in parallel

## Summary: What is Deep Reinforcement Learning?

The focus of this guide was combining both deep learning and reinforcement learning.

As we know, reinforcement learning is quite different from supervised or unsupervised learning.

To recap, here are the concepts we discussed include:

- We reviewed Markov Decision Processes and 3 solutions: Dynamic Programming, Monte Carlo methods, and Temporal Difference Learning
- We saw that we can use OpenAI Gym environments to train our agents
- We looked at N-step methods and TD-Lamda and saw that these are methods in between Monte Carlo and TD learning
- We can use Policy Gradient Methods to parameterize the policy, which allows us to handle continuous action spaces
- With Deep Q-Learning we can use experience replay using a target network and combining previous information about the state into the current state in order to model motion and velocity

That's it for this guide to deep reinforcement learning, you can find additional resources below.