The problem of reinforcement learning can be formulated as sequential decision making combined with uncertainty. Specifically, an agent operates in an uncertain environment and chooses an action at each discrete time step.

Each action that an agent chooses has an immediate reward and also influences future potential reward. In this article, we'll look at how an agent selects these actions under uncertainty, which is referred to as **the policy.**

A policy is defined as a distribution over actions for each possible state.

This article is based on notes from Week 3 of this course on the Fundamentals of Reinforcement Learning and is organized as follows:

- Specifying Policies
- Value Functions
- Deriving the Bellman Equation
- Optimal Policies
- Optimal Value Functions
- Using Optimal Value Functions to Find Optimal Policies

## Specifying Policies

In the simplest terms, a policy maps each state to a single action.

The notation for a **deterministic policy** is:

$$\pi (s) = a$$

In other words, $\pi (s)$ represents the action selected in state $s$ by the policy $\pi$.

For example, $\pi$ may select the action $A_1$ in state $S_0$, and action $A_1$ in states $S_1$ and $S_2$.

From this example we can see that an agent can select the same action in multiple states and some actions may not be selected in any state.

The notation for a **stochastic policy** is:

$$\pi (a|s)$$

This policy represents the probability of selecting action $a$ in state $s$.

A stochastic policy means there are multiple actions that may be selected with non-zero probability.

There are several basic rules for stochastic probabilities, these include:

- The sum over all probability must be one for each state

$$\sum\limits_{a \in \mathscr{A}(s)} \pi (a|s) = 1$$

- Each action probability must be non-negative

$$\pi (a|s) \geq 0$$

Another important point about policies is that they depend only on the current state, as opposed to other information like time or previous states.

In other words, in a Markov decision process the current state defines all the information used to select the current action.

In summary, an agents behavior in an environment is specified by a **policy **that maps the **current state **to a set of** probabilities for taking each action. **The policy also only depends on the current state.

## Value Functions

Many real-world problems involve some aspect of delayed rewards.

In reinforcement learning, rewards capture the notion of short-term gains.

The objective of an agent, however, is to learn a policy that maximizes the cumulative* long-term* reward.

In this section, we'll look at the roles of state-values, actions values, and the relationship between value function and policies.

### State-Value Functions

A state-value function is the expected return from a given state.

Mathematically, we can describe a state-value function as follows:

$$v(s) \doteq \mathbb{E} [G_t|S_t = s]$$

Since the agent's behavior also determines how much expected reward it will receive, the value function is defined with respect to a given policy:

$$v_\pi (s) \doteq \mathbb{E}_\pi [G_t|S_t = s]$$

Here we can see both the state-value function $v_\pi (s)$ and the expected reward $\mathbb{E}_\pi$ are computed with respect to the policy $\pi$.

### Action-Value Functions

An action-value function of a state describes the expected return if the agent selects action $A$ with respect to policy $\pi$:

$$q_\pi(s,a) \doteq \mathbb{E}_\pi [G_t | S_t = s, A_t = a]$$

In summary, value functions $v_\pi (s)$ allow an agent to predict future rewards, instead of long-term outcome.

The benefits of a value function are that an agent is able to estimate rewards that may not immediately be available, and may also be random due to the stochastic nature of the environment.

The value function summarizes all possible future states by averaging over returns, allowing us to judge the quality of different polices.

In the next section, we'll look at how value functions can be computed with the Bellman equation.

## Deriving the Bellman Equation

In reinforcement learning, we want the agent to be able to relate the value of the current state to the value of future states, without waiting to observe all future rewards.

The Bellman equation is one way to formalize this connection between the value of a state and future possible states.

In this section, we'll look at how to derive the Bellman equation for state-value functions, action-value functions, and understand how it relates current and future values.

### State-Value Bellman Equation

The Bellman equation for the state-value equation defines the relationship between the value of a state and the value of future possible states.

$$v_\pi (s) \doteq \mathbb{E}_\pi [G_t|S_t = s]$$

Recall in a previous article we derived the return as the discounted sum of future rewards:

$$G_t = \sum\limits_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

We also saw that the return at time $t$ can be written recursively as the immediate reward $R_{t+1}$ plus the discounted return at time $t+1$:

$$v_\pi (s) \doteq \mathbb{E}_\pi [R_{t+1} + \gamma G_{t+1} |S_t = s]$$

We can expand this expected return as the sum of possible action choices made by the agent. We then expand over possible rewards and next states condition on state $S$ and action $a$

$$ = \sum\limits_a \pi(a|s) \sum\limits_{s'} \sum_r p(s', r | s, a)\big[r + \gamma \mathbb{E} [G_{t+1} | S_{t+1} = s']\big]$$

We break it down in the order above as the action choice only depends on the current state, while the next state and reward depend only on the current state and action.

The result is a weighted sum of terms that consists of immediate rewards plus expected future rewards from the next state $s'$.

The equation above just shows that the expectation is the sum of possible outcomes weighted by the probability that they will occur.

Note that we could expand this equation indefinitely, although instead of making it overcomplicated we use the expected return as the value function for state $s'$. The only difference is that it is at time $t+1$ instead of time $t$.

From this replacement, we get the Bellman equation for the state-value function:

$$ = \sum\limits_a \pi(a|s) \sum\limits_{s'} \sum_r p(s', r | s, a)[r + \gamma v_\pi s']$$

The value function is so useful in reinforcement learning as it's essentially a stand-in for the average of an infinite number of possible values.

### Action-Value Bellman Equation

The Bellman equation for the action-value function is similar in that it is a recursive equation for the value of a state-action pair of future possible pairs.

$$q_pi \doteq \mathbb{E}[G_t | S_t = s, A_t = a]$$

$$ = \sum\limits_{s'} \sum_r p(s', r | s, a) \big[r + \gamma\mathbb{E}_\pi [G_{t+1} | S_{t+1} = s']\big]$$

Unlike the Bellman equation for the state-value function, however, we also want a recursive equation for the value of one state-action pair in terms of the next state-action pair. Currently, we only have the expected return given the next state.

To account for this, we can express the expected return of the next state as the sum of the agents possible action choices:

$$ = \sum\limits_{s'} \sum_r p(s', r | s, a) \big[r + \gamma \sum\limits_{a'} \pi (a'|s') \mathbb{E}_\pi [G_{t+1} | S_{t+1} = s', A_{t+1} = a']\big]$$

Above we have changed the expectation to be based on both the next state and the next action over the sum of possible actions.

This expected return we have now is the same as the definition of the action-value function for $s'$ and action $a'$.

$$ = \sum\limits_{s'} \sum_r p(s', r | s, a) \big[r + \gamma \sum\limits_{a'} \pi (a'|s') q_{\pi}(s', a')\big]$$

The replacement above gives the Bellman equation for the action-value function.

Next, we'll look at why these Bellman equations are so fundamental in reinforcement learning.

### Bellman Equations and Value Functions

In this section, we'll discuss how we can use Bellman equations to compute value functions.

If we consider an example of an agent trying to solve a simple grid world task—for example with four states and four action spaces—we can use the Bellman equations to directly write down a system of linear equations for the state values and solve the systems to find the exact values.

It is possible to directly solve MDPs of moderate size, although in more complex systems this won't always be practical or possible—for example if we're playing Chess, which has around 10^45 possible states.

Later, we'll discuss several reinforcement learning algorithms that can scale up and solve these larger and more complex tasks.

## Optimal Policies

As mentioned, a policy specifies how an agent behaves in the environment. Given this policy, we then aim to find the value function.

In reinforcement learning, the goal is to find the policy that obtains the most reward possible over the long run.

In this section, we'll define an optimal policy and how to identify one for a given MDP.

### Defining an Optimal Policy

Defining an optimal policy means understanding what makes one better than another.

If we're considering discrete states, however, one policy may be better at a certain time and another at a different time.

We can denote one policy as better than another as $\pi_i \geq \pi_2$.

An optimal policy is defined as the policy with the highest possible value function in all states.

We can denote the optimal policy as $\pi_*$

There may be more than one optimal policy, although there is always at least one optimal deterministic policy.

If one is better than another at a certain time and a second policy is optimal at another time, we can combine these into a third policy that only chooses the policy that has the highest value in the current state.

Since there may be an exponential number of possible policies in more complex environments, below we'll discuss what are referred to as Bellman Optimality Equations.

## Optimal Value Functions

Similar to the concept of optimal policies, optimal value functions for state-value and action-values are key to achieving the goal of reinforcement learning.

In this section we'll derive the Bellman optimality equation for state-value and action-value functions.

Recall that a policy $\pi_1$ is only considered as good or better than $\pi_2$ if the value under $\pi_1 \geq \pi_2$ for all states:

$$\pi_1 \geq \pi_2 \:\text{if and only if} \:v_{\pi_1} (s) \geq v_{\pi_2} (s) \text{for all}\: s \in \mathcal{S}$$

This means the value function for the optimal policy has the greatest possible in every state:

$$v_{\pi _*}\doteq \mathbb{E}[G_t | S_t = s]$$

We can express this mathematically as follows:

$$\max\limits_{\pi} v_{\pi }(s) \:\text{for all} \:S \in \mathcal{S}$$

Optimal polices also share the same optimal action-value function, which is the maximum for evert state action pair:

$$q_{\pi_*}(s, a) = \max\limits_\pi q_\pi (s,a) \:\text{for all}\: s \in \mathcal{S} \:\text{and} \:a \in \mathscr{A}$$

By substituting the optimal value function into the Bellman equation below, we get the Bellman equation for $v*$

$$ v*(s) = \sum\limits_a \pi_*(a|s) \sum\limits_{s'} \sum_r p(s', r | s, a)[r + \gamma v_* s']$$

We can rewrite this in a special form that doesn't reference itself as follows:

$$ v*(s) = \max\limits_a \sum\limits_{s'} \sum\limits_r p(s', r|s,a [r + \gamma v*(s')])$$

Notice that $\pi*$ no longer exists in the equation, meaning we've derived an relationship that applies directly to the value function $v*$.

This is referred to as the Bellman optimality equation for $v*$.

We can make the same substitution in the Bellman equation for the action value:

$$ q*(s,a)= \sum\limits_{s'} \sum_r p(s', r | s, a) \big[r + \gamma \max\limits_{a'} q* (a'|s')\big]$$

This gives us the Bellman optimality equation for $q*$.

Next, we'll look at how we can use the optimal value function $v*$ to obtain the optimal policy $\pi*$.

## Using Optimal Value Functions to Find Optimal Policies

As mentioned, the ultimate goal of finding the optimal value function is to find the optimal policy.

In this section, we'll look at the connection between optimal value functions and optimal policies.

We just derived the optimal value function $v*$, so we'll assume that we have already determined that for a given environment.

As long as we have the optimal value function $v*$ and the dynamics function $p$, we can work out the optimal policy as follows:

$$\pi_*(s) = \underset{a}{\arg\max}\sum\limits_{s'}\sum\limits_r p(s', r, |s, a) [r + \gamma v_*(s')]$$

For any state, we can evaluate which action over this term $\sum\limits_{s'}\sum\limits_r p(s', r, |s, a) [r + \gamma v_*(s')]$ will obtain the maximum possible action.

This is similar to $v*$, which is the maximum of $\sum\limits_{s'}\sum\limits_r p(s', r, |s, a) [r + \gamma v_*(s')]$ over all actions.

The optimal policy $\pi_*$ is the $\underset{a}{\arg\max}$, which means it is the specific action that achieves this maximum.

In summary, if we have access to $p$, we can determine the optimal action by computing the right hand side of the Bellman optimality equation $\sum\limits_{s'}\sum\limits_r p(s', r, |s, a) [r + \gamma v_*(s')]$ for each action and finding the largest value.

If we have access to $q*$, it's even easier to determine the optimal policy:

$$\pi_* = \underset{a}{\arg\max} q*(s, a)$$

In this case, we can select any action $a$ that maximizes $q*(s, a)$. From this, we can see that the problem of finding an optimal action-value function corresponds with finding an optimal policy.

## Summary: Value Functions & Bellman Equations

In this article, we discussed policies, value functions, and Bellman equations.

In summary, policies tell an agent how to behave in an environment. A policy depends only on the current state, meaning the state should provide the agent with all the necessary information to choose an action.

Deterministic policies map each state to an action, whereas stochastic policies map each state to a distribution of actions.

Value functions estimate future return under a certain policy. Value functions aggregate many possible future rewards into a single number.

The state-value function is the expected return of the current state under a policy. The action-value function provides the expected return from state $s$ if the agent selects action $a$ and follows policy $\pi$ after that.

Bellman equations define a relationship between the value of a state, or state-action pair, and its possible successors.

The Bellman equations can be directly solved to find the value function and help us evaluate polices.

An optimal policy is defined as the policy that achieves the highest value possible in each state. In order to find the optimal policy, we defined the optimal value function and the Bellman's optimality equations.