In this guide we'll discuss the application of using deep reinforcement learning for trading with TensorFlow 2.0.

In this article, we'll assume that you're familiar with deep reinforcement learning, although if you need a refresher you can find our full list of RL guides here.

This guide is based on notes from this TensorFlow 2.0 course and is organized as follows

- Building a Deep Q-Learning Trading Network
- Stock Market Data Preprocessing
- Training our Deep Q-Learning Trading Agent
- Summary: Deep Reinforcement Learning for Trading with TensorFlow 2.0

If you're interested in learning more about machine learning for trading and investing, check out our AI investment research platform: the MLQ app.

The platform combines fundamentals, alternative data, and ML-based insights.

You can learn more about the MLQ app here or sign up for a free account here.

*This post may contain affiliate links. See our **policy page** for more information.*

## 1. Building a Deep Q-Learning Trading Network

To start, we'll review how to implement deep Q-learning for trading with TensorFlow 2.0.

### Project Setup & Dependencies

The first step for this project is to change the runtime in Google Colab to GPU, and then we need to install the following dependancies:

```
pip install tensorflow-gpu==2.0.0.alpha0
pip install pandas-datareader
```

Next we need to import the following libraries for the project:

```
import math
import random
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import pandas_datareader as data_reader
from tqdm import tqdm_notebook, tqdm
from collections import deque
```

### Defining our Deep Q-Learning Trader

Now we need to define the algorithm itself with the `AI_Trader`

class, below are a few important points:

- In trading we have an action space of 3: Buy, Sell, and Sit
- We set the experience replay memory to
`deque`

with 2000 elements inside it - We create an empty list with
`inventory`

which contains the stocks we've already bought - We need to set an
`gamma`

parameter to`0.95`

, which helps to maximize the current reward over the long-term - The
`epsilon`

parameter is used to determine whether we should use a random action or to use the model for the action. We start by setting it to`1.0`

so that it takes random actions in the beginning when the model is not trained. - Over time we want to decrease the random actions and instead we can mostly use the trained model, so we set
`epsilon_final`

to 0.01 - We're then set the speed of decreasing epsilon in the
`epsilon_decay`

parameter

```
class AI_Trader():
def __init__(self, state_size, action_space=3, model_name="AITrader"):
self.state_size
self.action_space = action_space
self.memory = deque(2000)
self.inventory = []
self.model_name = model.name
self.gammsa = 0.95
self.epsilon = 1.0
self.epsilon_final = 0.01
self.epsilon_decay = 0.995
```

### Defining the Neural Network

Next we need to start defining our neural network.

The first step to define our neural network is to define a function called `model_builder`

which doesn't take any arguments, just the keyword `self`

.

We then define the model with `tf.keras.models.Sequential()`

.

To define with model's states, which are the previous `n`

days and stock prices of the days.

A state is just a vector of numbers and we can use a fully connected network, or a dense network.

Next, we add the first dense layer with `tf.keras.layers.Dense()`

and specify the number of neurons in the layer to 32 and set the activation to `relu`

. We also need to define the input shape in the first layer with `input_dim=self.state_size`

We're going to use 3 hidden layers in this network, so we add 2 more and change the architecture of to 64 neurons in the second and 128 for the last layer.

We then need to define the output layer and compile the network.

To define the output layer we need to set the number of neurons to the number of actions we can take, 3 in this case. We're also going to change the activation function to `relu`

because we're using mean-squared error for the loss:

```
def model_builder(self):
model = tf.keras.models.Sequential()
model.add(tf.layers.Dense(units=32, activation='relu', input_dim=self.state_size)
model.add(tf.layers.Dense(units=64, activation='relu')
model.add(tf.layers.Dense(units=128, activation='relu')
model.add(tf.layers.Dense(units=self.action_space, activation='linear')
```

Finally, we need to compile the model. Since this is a regression task we can't use accuracy as our loss, so we use `mse`

. We then use the `Adam`

optimizer and set the learning rate to 0.001 and return the model:

```
model.compile(loss='mse', optimizer=tf.keras.optimizer.Adam(lr=0.001))
return model
```

To return the model we just need to add `self.model = self.model_builer`

to our `__init__`

function. This function will create the network, initialize it, and store it in the `self.model`

argument.

### Building a Trading Function

Now that we've defined the neural network we need to build a function to trade that takes the state as input and returns an action to perform in that state.

To do this we're going to create a function called `trade`

that takes in one argument: `state`

.

For each state, we need to determine if we should use a randomly generated action or the neural network.

To do this, we use the `random`

library, and if it is less than our `epsilon`

we return a random action with `random.randrange()`

and pass in `self.action_space`

.

If the number is greater than `epsilon`

we use our model to choose the action. To do this, we define `actions`

equal to `self.model.predict`

and pass in the `state`

as the argument.

We then return a single number with `np.argmax`

to return only the `action`

with the highest probability.

To summarize:

- The function takes as input the shape and generates a random number
- If the number is less than or equal to
`epsilon`

it will generate a random action (this will always be the case in the beginning) - If it is greater than
`epsilon`

it will use the model to perform a prediction on the input state and return the action that has the highest probability

```
def trade(self, state):
if random.random() <= self.epsilon:
return random.randrange(self.action_space)
actions = self.model.predict(actions[0])
```

### Training the Model

Now that we've implemented the `trade`

function let's build a custom training function.

This function will take a batch of saved data and train the model on that, below is a step-by-step process to do this:

- We define this function
`batch_trade`

and it will take`batch_size`

as an argument - We select data from the experience replay memory by first setting
`batch`

to an empty list - We then iterate through the memory with a for loop
- Since we're dealing with time series data we need to sample from the end of the memory instead of randomly sampling from it
- Now that we have a batch of data we need to iterate through each batchβ
`state`

,`reward`

,`next_state`

, and`done`

βand train the model with this - If the agent is not in a terminal state we calculate the discounted total reward as the current
`reward`

- Next we define the
`target`

variable which is also predicted by the model - Next we fit the model with
`self.model.fit()`

- At the end of this function we want to decrease the epsilon parameter so that we slowly stop performing random actions

```
def batch_train(self, batch_size):
batch = []
for i in range(len(self.memory) - batch_size + 1, len(self.memory)):
batch.append(self.memory[i])
for state, action, reward, next_state, done in batch:
reward = reward
if not done:
reward = reward + self.gamma * np.amax(self.model.predict(next_state)[0])
target = self.model.predict(state)
target[0][action] = reward
self.model.fit(state, target, epochs=1, verbose=0)
if self.epsilon > self.epsilon_final:
self.epsilon *= self.epsilon_decay
```

## 2. Stock Market Data Preprocessing

Now that we've built our `AI_Trader`

class we now need to create a few helper functions that will be used in the learning process.

In particular, we need to define the following 3 functions:

1. `sigmoid`

- sigmoid is an activation function, generally used at the end of a network for binary classification as it scales a number to a range from 0 to 1. This will be used to normalize stock price data.

```
def sigmoid(x):
return 1 (1 + math.exp(-x))
```

2. `stocks_price_format`

- this is a formatting function to print out the prices of the stocks we bought or sold.

```
def stock_price_format(n):
if n < 0:
return "- # {0:2f}".format(abs(n))
else:
return "$ {0:2f}".format(abs(n))
```

3. `dataset_loader`

- this function connects with a data source and pulls the stock data from it, in this case we're loading data from Yahoo Finance:

```
def dataset_loader(stock_name):
dataset = data_reader.DataReader(stock_name, data_source="yahoo")
start_date = str(dataset.index[0]).split()[0]
end_date = str(dataset.index[1]).split()[0]
close = dataset['Close']
return close
```

Below we can take a look at the AAPL dataset. With this information we are going to build states for our network.

### State Creator

Now that we have our `dataset_loader`

function we need to create a function that takes this data and generates states from it.

Let's first look at how we can translate the problem of stock market trading to a reinforcement learning environment.

- Each point on a stock graph is just a floating number that represents a stock price at a given time.
- Our task is to predict what is going to happen in the next period, and as mentioned there are 3 possible actions: buy, sell, or sit.

This is regression problem - let's say we have a `window_size = 5`

so we use 5 states to predict our target, which is a continuous number.

Instead of predicting real numbers for our target we instead want to predict one of our 3 actions.

Next we're going change our input states to be differences in stock prices, which will represent price changes over time.

To implement this in Python we're going to create a function `state_creator`

which takes 3 arguments: `data`

, `timestep`

, and `window_size`

:

- We first need to calculate the
`starting_id`

- When the
`starting_id`

is positive we create a state and if it is negative we append the info until we get to the`window_size`

- Next we define an empty list called
`state`

and iterate through the`window_data`

list. - As we append the state we need to normalize the price data with the
`sigmoid`

function - To complete the function we return a NumPy array of the state

```
def state_creator(data, timestep, window_size):
starting_id = timestep - window_size + 1
if starting_id >= 0:
windowed_data = data[starting_id:timestep+1]
else:
windowed_data = starting_id * [data[0]] + list(data[0:timestep+1])
state = []
for i in range(window_size - 1):
state.append(sigmoid(windowed_data[i+1] - windowed_data[i]))
return np.array([state])
```

### Loading a Dataset

Now that we have our `state_creator`

function we can load our dataset.

First we need to define a new variable called `stock_name`

, and for this example we'll use `AAPL`

.

Then we define a variable called `data`

with our `dataset_loader`

function:

## 3. Training the Q-Learning Trading Agent

Before we proceed to training our model, let's define a few hyperparameters, including:

```
window_size = 10
episodes = 1000
batch_size = 32
data_samples = len(data) - 1
```

Now it's time to define our trading agent, and let's take a look at a summary of the model:

```
trader = AI_Trader(window_size)
trader.model.summary()
```

### Defining a Training Loop

Now we need to train our model, which we're going to do with a `for`

loop that will iterate through all of the `episodes`

.

- Next we want to print out the current episode
- We then need to define our initial state with
`state_creator`

- Then we define 2 variables so that we can keep track of
`total_profit`

and we set our inventory to 0 at the beginning of an episode with`trader.inventory = []`

- Next we define our timestep (1 timestep is 1 day) with a
`for`

loop, which represents how many samples we have. To do this we need to define our`action`

,`next_state`

, and`reward`

. - Then we want to update our
`inventory`

based on the given`action`

- Based on the actions we can calculate our
`reward`

and update the`total_profit`

- We then need to check if this is the last sample in our dataset
- Next we need to append all of the data to our trader's experience replay buffer with
`trader.memory.append()`

- We then change the
`state`

to the`next_state`

so we can iterate through the whole`episode`

- Finally we want to print out the
`total_profit`

if`done = True`

and add print statements to when we buy or sell and how what the profit is

There are two more things to do before starting the training process:

- We need to check if we have more information in our
`memory`

than our`batch_size`

. If that is true we call`trader.batch_train`

and pass in the`batch_size`

argument - We're then going to check if the number of episodes is divisible by 10, and if that is the case we're going to save the model with
`trader.model.save()`

in an H5 file

```
for episode in range(1, episodes + 1):
print("Episode: {}/{}".format(episode, episodes))
state = state_creator(data, 0, window_size + 1)
total_profit = 0
trader.inventory = []
for t in tqdm(range(data_samples)):
action = trader.trade(state)
next_state = state_creator(data, t+1, window_size + 1)
reward = 0
if action == 1: #Buying
trader.inventory.append(data[t])
print("AI Trader bought: ", stocks_price_format(data[t]))
elif action == 2 and len(trader.inventory) > 0: #Selling
buy_price = trader.inventory.pop(0)
reward = max(data[t] - buy_price, 0)
total_profit += data[t] - buy_price
print("AI Trader sold: ", stocks_price_format(data[t]), " Profit: " + stocks_price_format(data[t] - buy_price) )
if t == data_samples - 1:
done = True
else:
done = False
trader.memory.append((state, action, reward, next_state, done))
state = next_state
if done:
print("########################")
print("TOTAL PROFIT: {}".format(total_profit))
print("########################")
if len(trader.memory) > batch_size:
trader.batch_train(batch_size)
if episode % 10 == 0:
trader.model.save("ai_trader_{}.h5".format(episode))
```

## 4. Summary: Deep Reinforcement Learning for Trading with TensorFlow 2.0

In this article, we looked at how to build a trading agent with deep Q-learning using TensorFlow 2.0.

We started by defining an `AI_Trader`

class, then we loaded and preprocessed our data from Yahoo Finance, and finally we defined our training loop to train the agent.

Although this surely won't be the best AI trading agent of all time (and, of course, is not recommended to trade with), it does provide a good starting point to build off of.

To finish off, here are a few ways that we could improve this model:

- Adding trend following indicators to our input data
- We could use an LSTM network instead of simple dense layers
- We could use sentiment analysis with natural language processing to provide the model with more input data