One of the most exciting areas of applied AI research is in the field of deep reinforcement learning for trading.
As we'll se in this article, given the fact that trading and investing is an iterative process deep reinforcement learning likely has huge potential in finance.
Trading is a constant process of testing new ideas, receiving feedback from the market in the form of profit/loss, and trying to optimize your strategy over time.
This trial-and-error approach to decision making is exactly what reinforcement learning attempts to solve, and it's been referred to as "the computational science of decision making".
In this article we'll take a look at the available research, papers, and open-source repositories to get a better understanding of deep reinforcement learning and trading.
In particular, the topics we will cover include:
- What is Reinforcement Learning?
- Introduction to Reinforcement Learning for Trading
- Introduction to Q-Learning
- J.P. Morgan's Guide to Reinforcement Learning
1. What is Reinforcement Learning?
If you want to read a complete guide to reinforcement learning, I recommend our article: What is Reinforcement Learning? A Complete Guide for Beginners.
In this article, we'll just summarize the RL framework.
Reinforcement Learning is a framework for an agent learning to operate in an uncertain environment through interaction.
Let's break reinforcement learning down step-by-step:
- We have an agent, who is our decision-maker/learner
- The agent operates in an environment
- As we take actions, the environment provides feedback in the form of a rewards
- From these rewards, or labels, the agent gets a new observation and then must select another action, at the next time step.
- The observation is called a state
- Since the problem needs to be solved now, but the rewards come in the future, we need define a decision policy - which is essentially our strategy for maximizing long-term expected reward.
To summarize, reinforcement learning is essentially a framework for a feedback loop of state -> action -> reward provided by the environment.
The presence of a feedback loop from an environment is unique to the RL framework, as these loops are not found in supervised or unsupervised learning.
The goal of the agent is thus to maximize expected cumulative reward.
So the agent is looking to find a set of actions for which the expected cumulative reward is expected to be high.
Specifically, we want our agent to learn a policy, which the agent can use to perform actions and maximize it's rewards given certain circumstances.
Since we are dealing with time-series data, we also have a discount factor - ɣ - which determines the importances of future rewards. A discount factor of 0 would tell the agent to only consider immediate rewards, and a discount factor of 1 tells the agent to focus on long-term rewards.
From our Guide to Reinforcement Learning:
It is the powerful combination of pattern-recognition networks and real-time environment based learning frameworks called deep reinforcement learning that makes this such an exciting area of research.
The deep part of Deep Reinforcement Learning is a more advanced implementation in which we use a deep neural network to approximate the best possible states and actions.
2. Introduction to Reinforcement Learning for Trading
There are two types of tasks that an agent can attempt to solve in reinforcement learning:
- Episodic Tasks - which are tasks that end at some time step T
- Continuing Tasks - which are tasks where the interaction continues without an end-point
Since the markets never really have an end-point, trading is a continuing task.
Also, since we are dealing with other agents (traders) in the market, which we can't observe (things like account size, open orders, etc.).
This makes trading a partially observable Markov Decision Process.
A partially observable MDP is where we don't know what the true state looks like, but we can observe part of it (our P&L).
Because it is partially observable and we don't know the full state, we also don't know the reward function and transition probability looks like.
If we knew these 2 variables we would use Dynamic Programming to compute the optimal policy.
Since we don't in the case of trading, we can instead use a model-free reinforcement learning algorithm like Q-Learning.
Q-Learning allows us to compute a policy without needing to build a full model of our environment.
In Q-Learning, the possible states and actions are represented by a Q-table, and the equation for how these values are updated is shown below from this article:
A Q-table is where the states are rows, and actions are columns, and it helps us find the best action to take for each state.
Q of $s_t$ and $a_t$ represents the maximum discounted future reward when we perform an action in state $s$ and continue optimally from then on.
We can think of this function as the maximum possible account balance we can achieve at the end of a training episode after we perform action $a$ in state $s$.
In the case of trading the possible actions are:
The Q function will rate each of the possible actions and will pick the one that has the highest Q value.
Q-Learning is the process of learning what the Q-table is, without needing to learn the reward function or the transition probability.
Let's now look at 2 Github repos on this topic:
Let's look at an example of using deep reinforcement learning for trading from this Q-Trader Github repository.
The model is...
An implementation of Q-learning applied to (short-term) stock trading. The model uses n-day windows of closing prices to determine if the best action to take at a given time is to buy, sell or sit.
As a result of the short-term state representation, the model is not very good at making decisions over long-term trends, but is quite good at predicting peaks and troughs.
Let's take a look at the agent.py file:
import keras from keras.models import Sequential from keras.models import load_model from keras.layers import Dense from keras.optimizers import Adam import numpy as np import random from collections import deque class Agent: def __init__(self, state_size, is_eval=False, model_name=""): self.state_size = state_size # normalized previous days self.action_size = 3 # sit, buy, sell self.memory = deque(maxlen=1000) self.inventory =  self.model_name = model_name self.is_eval = is_eval self.gamma = 0.95 self.epsilon = 1.0 self.epsilon_min = 0.01 self.epsilon_decay = 0.995 self.model = load_model("models/" + model_name) if is_eval else self._model() def _model(self): model = Sequential() model.add(Dense(units=64, input_dim=self.state_size, activation="relu")) model.add(Dense(units=32, activation="relu")) model.add(Dense(units=8, activation="relu")) model.add(Dense(self.action_size, activation="linear")) model.compile(loss="mse", optimizer=Adam(lr=0.001)) return model def act(self, state): if not self.is_eval and np.random.rand() <= self.epsilon: return random.randrange(self.action_size) options = self.model.predict(state) return np.argmax(options) def expReplay(self, batch_size): mini_batch =  l = len(self.memory) for i in xrange(l - batch_size + 1, l): mini_batch.append(self.memory[i]) for state, action, reward, next_state, done in mini_batch: target = reward if not done: target = reward + self.gamma * np.amax(self.model.predict(next_state)) target_f = self.model.predict(state) target_f[action] = target self.model.fit(state, target_f, epochs=1, verbose=0) if self.epsilon > self.epsilon_min: self.epsilon *= self.epsilon_decay
We can then train our agent using this script:
from agent.agent import Agent from functions import * import sys if len(sys.argv) != 4: print "Usage: python train.py [stock] [window] [episodes]" exit() stock_name, window_size, episode_count = sys.argv, int(sys.argv), int(sys.argv) agent = Agent(window_size) data = getStockDataVec(stock_name) l = len(data) - 1 batch_size = 32 for e in xrange(episode_count + 1): print "Episode " + str(e) + "/" + str(episode_count) state = getState(data, 0, window_size + 1) total_profit = 0 agent.inventory =  for t in xrange(l): action = agent.act(state) # sit next_state = getState(data, t + 1, window_size + 1) reward = 0 if action == 1: # buy agent.inventory.append(data[t]) print "Buy: " + formatPrice(data[t]) elif action == 2 and len(agent.inventory) > 0: # sell bought_price = agent.inventory.pop(0) reward = max(data[t] - bought_price, 0) total_profit += data[t] - bought_price print "Sell: " + formatPrice(data[t]) + " | Profit: " + formatPrice(data[t] - bought_price) done = True if t == l - 1 else False agent.memory.append((state, action, reward, next_state, done)) state = next_state if done: print "--------------------------------" print "Total Profit: " + formatPrice(total_profit) print "--------------------------------" if len(agent.memory) > batch_size: agent.expReplay(batch_size) if e % 10 == 0: agent.model.save("models/model_ep" + str(e))
In order to test this agent, we download a training and test CSV files from Yahoo! Finance into
We then train the the agent on Facebook (FB) - the training period ranges 4 years - from Mar. 27, 2014 - Mar. 27, 2018. The testing period will be 1 year from Mar. 27, 2018 - Mar. 27, 2019
Since the Github repo uses Python2 we will need to update the
range() since it was renamed in Python3.
To train the model we will use the following commands for FB, training it on a window size of 10, and 200 episodes:
mkdir models python train.py FB_train 10 200
After the training phase we evaluate the model with:
python evaluate.py FB_test model_ep200
This test ending showing a small loss, but this is still a good starting point for understanding deep reinforcement learning in trading.
Q-Learning for Trading
To summarize this repo, here is how the author formulated the problem:
- At any given point, the state is represented as an array of [# of stock owned, current stock prices, cash in hand].
- For example if we have 50 shares of FB $165 and at and 40 shares of Amazon at $1700, and $10,000 cash on hand - the state array would be [50, 40, 165, 1700, 10000].
- We have three possible actions: BUY, SELL, or HOLD
- There are several ways this is formulated, although the one that is chosen is: +/- $ amount of current value compared with previous step
To test this agent the author uses three stocks: MSFT, IBM, and QCOM.
The period is from January 3rd, 2000 to December 27, 2017, using daily close prices.
4629 days of data are used for training while the last 1000 days are used for testing, and the Deep Q Network is trained for 2000 epochs.
From the authors results below, we can see the portfolio values are incredibly volatile:
Of course this variance is far too high and cannot be ignored, but this provides another solid base to build off in order to continue researching this topic.
In order to improve our own system we could also combine the RL algorithm with other features that we engineer, such as company news, performance, etc. - in both examples we used stock price as our only feature.
4. J.P. Morgan's Guide to Reinforcement Learning
If you want to read more about practical applications of reinforcement learning in finance check out J.P. Morgan's new paper: Idiosyncrasies and challenges of data driven learning in electronic trading.
The report was presented at the NIPS conference in May 2018, but has only recently been made public.
Here's the outline of the paper:
We outline the idiosyncrasies of neural information processing and machine learning in quantitative finance. We also present some of the approaches we take towards solving the fundamental challenges we face.
In addition to discussing supervised and unsupervised learning in finance, this paper:
shows the interplay between the agent’s constraints and rewards in one practical application of reinforcement learning.
The paper also discusses inverse reinforcement learning (IRL), which is the field of study that focuses on learning an agent’s objectives, values, or rewards by observing its behavior.
We also believe that inverse reinforcement learning is very promising: leveraging the massive history of rollouts of human and algo policies on financial markets in order to build local rewards is an active field of research.
5. Summary: Deep Reinforcement Learning for Trading
In this guide we looked at how we can apply the Q-learning algorithm to the continuous reinforcement learning task: trading.
To the best of my knowledge, mature commercial reinforcement learning trading applications aren't yet available - and this makes sense since the stable convergence of an RL system is still a hot topic in academic research.
Of course quantifying the financial markets is no easy pursuit, but if you would like to learn more about the topic I recommend the following resources.
- Reinforcement Learning for Trading
- Financial Trading as a Game: A Deep Reinforcement Learning Approach
- Financial Trading as a Game: A Deep Reinforcement Learning Approach
- Reinforcement Learning For Automated Trading