One of the most exciting areas of applied AI research is in the field of deep reinforcement learning for trading.

As we'll se in this article, given the fact that trading and investing is an iterative process deep reinforcement learning likely has huge potential in finance.

Trading is a constant process of testing new ideas, receiving feedback from the market in the form of profit/loss, and trying to optimize your strategy over time.

This trial-and-error approach to decision making is exactly what reinforcement learning attempts to solve, and it's been referred to as *"the computational science of decision making"*.

In this article we'll take a look at the available research, papers, and open-source repositories to get a better understanding of deep reinforcement learning and trading.

In particular, the topics we will cover include:

- What is Reinforcement Learning?
- Introduction to Reinforcement Learning for Trading
- Introduction to Q-Learning
- J.P. Morgan's Guide to Reinforcement Learning

**1. What is Reinforcement Learning?**

If you want to read a complete guide to reinforcement learning, I recommend our article: What is Reinforcement Learning? A Complete Guide for Beginners.

In this article, we'll just summarize the RL framework.

**Reinforcement Learning is a framework for an agent learning to operate in an uncertain environment through interaction.**

Let's break reinforcement learning down step-by-step:

- We have an
**agent**, who is our decision-maker/learner - The agent operates in an
**environment** - As we take
**actions**, the environment provides feedback in the form of a**rewards** - From these rewards, or labels
**,**the agent gets a new observation and then must select another action, at the next time step. - The observation is called a
**state** - Since the problem needs to be solved now, but the rewards come in the future, we need define a decision policy - which is essentially our strategy for maximizing long-term expected reward.

To summarize, reinforcement learning is essentially a framework for a feedback loop of state -> action -> reward provided by the environment.

The presence of a feedback loop from an environment is unique to the RL framework, as these loops are not found in supervised or unsupervised learning.

The goal of the agent is thus to **maximize expected cumulative reward.**

So the agent is looking to find a set of actions for which the expected cumulative reward is expected to be high.

Specifically, we want our agent to learn a **policy**, which the agent can use to perform actions and maximize it's rewards given certain circumstances.

Since we are dealing with time-series data, we also have a discount factor - *ɣ* - which determines the importances of future rewards. A discount factor of 0 would tell the agent to only consider immediate rewards, and a discount factor of 1 tells the agent to focus on long-term rewards.

From our Guide to Reinforcement Learning:

It is the powerful combination of pattern-recognition networks and real-time environment based learning frameworks calleddeep reinforcement learningthat makes this such an exciting area of research.

The *deep* part of Deep Reinforcement Learning is a more advanced implementation in which we use a deep neural network to approximate the best possible states and actions.

**2. Introduction to Reinforcement Learning for Trading**

There are two types of tasks that an agent can attempt to solve in reinforcement learning:

- Episodic Tasks - which are tasks that end at some time step
*T* - Continuing Tasks - which are tasks where the interaction continues without an end-point

Since the markets never really have an end-point, **trading is a continuing task.**

Also, since we are dealing with other agents (traders) in the market, which we can't observe (things like account size, open orders, etc.).

**This makes trading a partially observable Markov Decision Process.**

A partially observable MDP is where we don't know what the *true state* looks like, but we can observe part of it (our P&L).

Because it is partially observable and we don't know the full state, we also don't know the **reward function **and **transition probability** looks like.

If we knew these 2 variables we would use Dynamic Programming to compute the optimal policy.

Since we don't in the case of trading, we can instead use a **model-free reinforcement learning algorithm** like** **Q-Learning.

**3. Q-Learning**

Q-Learning allows us to compute a policy without needing to build a full model of our environment.

In Q-Learning, the possible states and actions are represented by a Q-table, and the equation for how these values are updated is shown below from this article:

A Q-table is where the states are rows, and actions are columns, and it helps us find the best action to take for each state.

Q of $s_t$ and $a_t$ represents the maximum discounted future reward when we perform an action in state $s$ and continue optimally from then on.

We can think of this function as the maximum possible account balance we can achieve at the end of a training episode after we perform action $a$ in state $s$.

In the case of trading the possible actions are:

- Buy
- Sell
- Hold

The Q function will rate each of the possible actions and will pick the one that has the highest Q value.

**Q-Learning is the process of learning what the Q-table is, without needing to learn the reward function or the transition probability.**

Let's now look at 2 Github repos on this topic:

**Q-Trader**

Let's look at an example of using deep reinforcement learning for trading from this Q-Trader Github repository.

The model is...

An implementation of Q-learning applied to (short-term) stock trading. The model uses n-day windows of closing prices to determine if the best action to take at a given time is to buy, sell or sit.

As a result of the short-term state representation, the model is not very good at making decisions over long-term trends, but is quite good at predicting peaks and troughs.

Let's take a look at the agent.py file:

```
import keras
from keras.models import Sequential
from keras.models import load_model
from keras.layers import Dense
from keras.optimizers import Adam
import numpy as np
import random
from collections import deque
class Agent:
def __init__(self, state_size, is_eval=False, model_name=""):
self.state_size = state_size # normalized previous days
self.action_size = 3 # sit, buy, sell
self.memory = deque(maxlen=1000)
self.inventory = []
self.model_name = model_name
self.is_eval = is_eval
self.gamma = 0.95
self.epsilon = 1.0
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.model = load_model("models/" + model_name) if is_eval else self._model()
def _model(self):
model = Sequential()
model.add(Dense(units=64, input_dim=self.state_size, activation="relu"))
model.add(Dense(units=32, activation="relu"))
model.add(Dense(units=8, activation="relu"))
model.add(Dense(self.action_size, activation="linear"))
model.compile(loss="mse", optimizer=Adam(lr=0.001))
return model
def act(self, state):
if not self.is_eval and np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
options = self.model.predict(state)
return np.argmax(options[0])
def expReplay(self, batch_size):
mini_batch = []
l = len(self.memory)
for i in xrange(l - batch_size + 1, l):
mini_batch.append(self.memory[i])
for state, action, reward, next_state, done in mini_batch:
target = reward
if not done:
target = reward + self.gamma * np.amax(self.model.predict(next_state)[0])
target_f = self.model.predict(state)
target_f[0][action] = target
self.model.fit(state, target_f, epochs=1, verbose=0)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
```

We can then train our agent using this script:

```
from agent.agent import Agent
from functions import *
import sys
if len(sys.argv) != 4:
print "Usage: python train.py [stock] [window] [episodes]"
exit()
stock_name, window_size, episode_count = sys.argv[1], int(sys.argv[2]), int(sys.argv[3])
agent = Agent(window_size)
data = getStockDataVec(stock_name)
l = len(data) - 1
batch_size = 32
for e in xrange(episode_count + 1):
print "Episode " + str(e) + "/" + str(episode_count)
state = getState(data, 0, window_size + 1)
total_profit = 0
agent.inventory = []
for t in xrange(l):
action = agent.act(state)
# sit
next_state = getState(data, t + 1, window_size + 1)
reward = 0
if action == 1: # buy
agent.inventory.append(data[t])
print "Buy: " + formatPrice(data[t])
elif action == 2 and len(agent.inventory) > 0: # sell
bought_price = agent.inventory.pop(0)
reward = max(data[t] - bought_price, 0)
total_profit += data[t] - bought_price
print "Sell: " + formatPrice(data[t]) + " | Profit: " + formatPrice(data[t] - bought_price)
done = True if t == l - 1 else False
agent.memory.append((state, action, reward, next_state, done))
state = next_state
if done:
print "--------------------------------"
print "Total Profit: " + formatPrice(total_profit)
print "--------------------------------"
if len(agent.memory) > batch_size:
agent.expReplay(batch_size)
if e % 10 == 0:
agent.model.save("models/model_ep" + str(e))
```

In order to test this agent, we download a training and test CSV files from Yahoo! Finance into `data/`

.

We then train the the agent on Facebook (FB) - the training period ranges 4 years - from Mar. 27, 2014 - Mar. 27, 2018. The testing period will be 1 year from Mar. 27, 2018 - Mar. 27, 2019

Since the Github repo uses Python2 we will need to update the `print`

function to Python3 format. We will also need to change `xrange()`

to `range()`

since it was renamed in Python3.

To train the model we will use the following commands for FB, training it on a window size of 10, and 200 episodes:

```
mkdir models
python train.py FB_train 10 200
```

After the training phase we evaluate the model with:

`python evaluate.py FB_test model_ep200`

This test ending showing a small loss, but this is still a good starting point for understanding deep reinforcement learning in trading.

**Q-Learning for Trading**

Let's now look at Siraj Raval's video on Q-Learning for Trading, which uses code from ShuaiW - the code has this post to accompany the repo.

To summarize this repo, here is how the author formulated the problem:

**State**

- At any given point, the state is represented as an array of [# of stock owned, current stock prices, cash in hand].
- For example if we have 50 shares of FB $165 and at and 40 shares of Amazon at $1700, and $10,000 cash on hand - the state array would be [50, 40, 165, 1700, 10000].

**Action**

- We have three possible actions: BUY, SELL, or HOLD

**Reward**

- There are several ways this is formulated, although the one that is chosen is: +/- $ amount of current value compared with previous step

To test this agent the author uses three stocks: MSFT, IBM, and QCOM.

The period is from January 3rd, 2000 to December 27, 2017, using daily close prices.

4629 days of data are used for training while the last 1000 days are used for testing, and the Deep Q Network is trained for 2000 epochs.

From the authors results below, we can see the portfolio values are incredibly volatile:

Of course this variance is far too high and cannot be ignored, but this provides another solid base to build off in order to continue researching this topic.

In order to improve our own system we could also combine the RL algorithm with other features that we engineer, such as company news, performance, etc. - in both examples we used stock price as our only feature.

**4. J.P. Morgan's Guide to Reinforcement Learning**

If you want to read more about practical applications of reinforcement learning in finance check out J.P. Morgan's new paper: Idiosyncrasies and challenges of data driven learning in electronic trading.

The report was presented at the NIPS conference in May 2018, but has only recently been made public.

Here's the outline of the paper:

We outline the idiosyncrasies of neural information processing and machine learning in quantitative finance. We also present some of the approaches we take towards solving the fundamental challenges we face.

In addition to discussing supervised and unsupervised learning in finance, this paper:

shows the interplay between the agent’s constraints and rewards in one practical application of reinforcement learning.

The paper also discusses inverse reinforcement learning (IRL), which is the field of study that focuses on learning an agent’s objectives, values, or rewards by observing its behavior.

We also believe that inverse reinforcement learning is very promising: leveraging the massive history of rollouts of human and algo policies on financial markets in order to build local rewards is an active field of research.

The paper also mentions several open source reinforcement learning frameworks that you can make use of, including: OpenAI baselines, dopamine, deepmind/trfl and Ray RLlib.

**5. Summary: Deep Reinforcement Learning for Trading**

In this guide we looked at how we can apply the Q-learning algorithm to the continuous reinforcement learning task: trading.

To the best of my knowledge, mature commercial reinforcement learning trading applications aren't yet available - and this makes sense since the stable convergence of an RL system is still a hot topic in academic research.

Of course quantifying the financial markets is no easy pursuit, but if you would like to learn more about the topic I recommend the following resources.

**Further Resources**

**Github:**

**YouTube:**

**Articles:**

**Papers:**

- Reinforcement Learning for Trading
- Financial Trading as a Game: A Deep Reinforcement Learning Approach
- Financial Trading as a Game: A Deep Reinforcement Learning Approach
- Reinforcement Learning For Automated Trading

**Kaggle:**