Python is one of the fastest growing programming languages for applied finance and machine learning.
In this article, we'll look at how you can build models for time series analysis using Python.
As we'll discover, time series problems are different from traditional prediction problems.
The topics we'll cover in this guide include:
The following is based off of notes from this course on Python for Financial Analysis and Algorithmic Trading.
To get started, let's review a few key points about Pandas for time series data.
1. Pandas for Time Series Data
The majority of financial datasets will be in the form of a time series, with a DateTime index and a corresponding value.
Pandas has special features for working with time series data, in particular we'll look at:
- DateTime index
- Time Resampling
- Time Shifts
- Rolling and Expanding
Often in financial datasets the time and date won't be a separate column, but instead will be the index.
Built-in Python libraries exist for dates and times exist, so without installing any additional libraries we can use:
from datetime import datetime
This allows us to create timestamps or specific date objects.
Let's create a few variables:
my_year = 2019 my_month = 5 my_day= 1
To use Python's built-in
datetime functionality we can use:
my_date = datetime()
As we can see, this takes in year, month, day, and time - let's pass these arguments in.
my_date = datetime(my_year, my_month, my_day)
Let's take a look at this
Let's look at how we can convert a list of two
datetime objects to an index:
my_list = [datetime(2019,1,1), datetime(2019,1,2)]
We can convert a NumPy array or list to an index with the following:
dt_idx = pd.DatetimeIndex(my_list)
When dealing with financial datasets we usually get data that has a DateTime index on a smaller scale (day, hour, minute, etc.).
For the purpose of analysis, however, it is often a good idea to aggregate data based on some frequency (monthly, quarterly, etc.).
You might think that
GroupBy can solve this, but it isn't made to understand things like business quarters, the start of a year, or the start of a week.
pandas has frequency sampling tools built-in to solve this.
To understand this, let's take a look at stock market data for Tesla from May 1st, 2018 - May 1st, 2019, which can be downloaded from Yahoo Finance.
Here are our imports:
import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline
Now let's upload the data to Google Colab:
from google.colab import files uploaded = files.upload()
Then we'll read in our CSV and take a look at the data
df = pd.read_csv('TSLA.csv')
Date column is what we want to be the index, so we convert it to a datetime index with
pd.to_datetime() and passing in the Series:
df['Date'] = pd.to_datetime(df['Date'])
If we call
df.info() we can see it is now a
Let's now set the
Date column as the index:
To simplify this, we could have also just set the
index_col='Date' and set
We can then check the index with
To do any sort of time resampling we need a datetime index, and then we can resample the DataFrame with
df.resample() and then we pass in a
rule is just how we want to resample the data, and there are keywords for every type of time series offset strings, which you can read about more in the documentation.
The rule is essentially acting as a
GroupBy method specifically for time series data.
Let's look at an example of the
A rule, which stands for "year-end frequency" and we will get the mean value based off the resampling:
# mean value based off the end of the year resampling df.resample(rule='A').mean()
In this example everything before 2018-12-31 had a mean
Open value of $316.12.
Everything in between 2018-12-31 and 2019-12-31 had a mean of $291.44.
We can then get the mean value after quarterly resampling with the
# quarterly resampling df.resample(rule='Q').mean()
Often time series forecasting models require us to shift our data forward and backward with a certain amount of time steps.
Pandas makes this easy to do with the
To demonstrate this let's look at the Tesla CSV again, which we can see that it has daily data.
If we ever want to shift our time period up by one step we can use:
After this we can see that we no longer have any values for our first time period.
We can also shift the time period backwards by using
-1 for our
Pandas Rolling & Expanding
We can use pandas' built-in
rolling method, for example if we want to create a rolling mean based off a given time period.
When dealing with financial data often the daily data can be quite noisy.
To account for this, we can use the rolling mean (otherwise known as the Moving Average) to generate a signal about the general trend of the data.
Let's plot our daily data with:
Let's now average this out by the week - we can either get the Moving Average on a particular column or Series, or on the entire DataFrame with the
To do this we pass in 7 as the window and then add the aggregate function
We can see the first 6 values are null, and the 7th value is the mean of the first 6 rows.
Let's now plot the
Open column vs. the 7-day moving average of the Close column:
When we look at this plot we see that the blue line is the
Open price column, and the orange line is the rolling 7-day
Now, what do we do when we take to take into account everything from the start of the time series to the rolling point of the value?
For example, instead of just taking into account a 7-day rolling window, we take into account everything since the beginning of the time series to where we are at that point.
To do this we use the
So what does this plot represent?
At each time step on the x-axis, what is shown on the y-axis is the value of everything that came before it averaged out.
We'll look at more fundamental & technical analysis later, but one closely related topics to
.rolling() are Bollinger Bands so let's briefly discuss them.
Bollinger Bands are volatility bands placed above and below a moving average, where the volatility is based off the standard deviation which changes as volatility increases or decreases.
The bands widen when volatility increases and narrow when it decreases.
Let's look at how we can code Bollinger Bands with Pandas, here are the steps we need to take.
We need to create 3 columns and then we plot them out:
- The first column is the Closing 20-day Moving Average
- Then create the upper band equal to 20-day MA + 2x the standard deviation over 20 days
- The lower band is equal to 20-day MA - 2x STD over 20 days
# Close 20 MA df['Close: 20 Day Mean'] = df['Close'].rolling(20).mean() # Upper = 20MA + 2*std(20) df['Upper'] = df['Close: 20 Day Mean'] + 2*(df['Close'].rolling(20).std()) # Lower = 20MA - 2*std(20) df['Lower'] = df['Close: 20 Day Mean'] - 2*(df['Close'].rolling(20).std()) # Plot Close df[['Close','Close: 20 Day Mean','Upper','Lower']].plot(figsize=(16,6))
2. Time Series Analysis
Now that we've learnt about Pandas for time series data, let's shift focus on analysis techniques.
Time series data has special properties and a different set of predictive algorithms than other types of data.
A lot of financial data comes in the form of some value plotted against a time series.
We'll discuss the following topics:
- Introduction to Statsmodel
- ETS Models & Decomposition
- EWMA Models
- ARIMA Models
Introduction to Statsmodel
The most popular Python library for dealing with time series data is StatsModels:
statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
StatsModels is heavily inspired by the statistical programming language R.
It allows users to explore data, estimate statistical models, and perform statistical tests.
Let's look at the time series analysis
import statsmodels.api as sm and then load a dataset that comes with the library and then we'll load the macrodata dataset:
# import dataset with load_pandas method and .data attribute df = sm.datasets.macrodata.load_pandas().data df.head()
We can check out what is in the dataset with the
.NOTE attribute - this one is about economic data for the US.
Let's now set the year to be the time series index:
index = pd.Index(sm.tsa.datetools.dates_from_range('1959Q1','2009Q3')) df.index = index
Now that the year is a time series index, let's plot the realgdp column:
Let's do some analysis using statsmodel to get the trend of the data, and in this case we're going to use the Hodrick-Prescott filter:
This returns a tuple of the estimated cycle in the data and the estimated trend in the data.
We're then going to use tuple unpacking to get the trend and plot on top of this
# let's use tuple unpacking to get this trend and plot it on top of this gdp_cycle, gdp_trend = sm.tsa.filters.hpfilter(df['realgdp']) # add a a column for the trend df['trend'] = gdp_trend # plot the real gdp & the trend df[['realgdp','trend']].plot()
ETS Models with StatsModels
ETS model stands for Error-Trend-Seasonality.
Let's take a look at the ETS components of a time series dataset.
ETS models take each of the terms (Error-Trend-Seasonality) for smoothing purposes - and may add them, multiply them, or leave some of them out of the model.
Based off these key factors we can create a model to fit our data.
So how can we break down a time series into each of these terms?
Time Series Decomposition with ETS is a method of breaking down a time series into these components.
Here's how we would do ETS decomposition for the TSLA CSV:
from statsmodels.tsa.seasonal import seasonal_decompose result = seasonal_decompose(df['Adj Close'], model='additive', freq=12)
And then we can plot out the components, for example the trend:
And then we can plot all the results:
EWMA stands for Exponentially Weighted Moving Average.
We saw that with
pd.rolling() we can create a simple model that describes a trend of a time series - these are referred to as Simple Moving Averages (SMA).
A few of the weaknesses of SMA's include:
- A smaller windows will lead to more noise, rather than signal
- It will always lag the size of the window
- It will never reach the peak or valley of the data due to averaging
- It doesn't inform us about future behavior, it really just describes trends in the data
- Extreme historical values can skew the SMA
To recap, here's how we can calculate the 30-day SMA for TSLA:
# create 1 month SMA off of Adj Close df['30 Day SMA'] = df['Adj Close'].rolling(window=30).mean() # plot SMA & Adj Close df[['Adj Close', '30 Day SMA']].plot(figsize=(10,8))
Exponentially Weighted Moving Averages solve some of these issues, in particular:
- EWMA allows you to reduce the lag time from SMA and puts more weight on values that occur more recently
- The amount of weight applied to the recent values depends on the parameters used in the EWMA and the number of periods in the window size
Here's how we can create an EWMA model:
# create EWMA df['EWMA-30'] = df['Adj Close'].ewm(span=30).mean() # plot EWMA df[['Adj Close', 'EWMA-30']].plot(figsize=(10,8))
We can see the behavior at the beginning is different from at the end - this is because we've weighted the most recent points more heavily.
Although ARIMA models are one of the most common time series models, they often don't work well with historical market data so we won't cover them here.
If you want to learn more about ARIMA models check out this article from Machine Learning Mastery.
Summary: Time Series Analysis with Python
In this guide we reviewed time series analysis for financial data with Python.
We saw that time series problems are different from traditional prediction problems and looked at Pandas for time series data, as well as several time series analysis techniques.
statsmodellibrary is the most popular Python library for dealing with time series data is
- We saw how we can use
statsmodelsfor ETS (Error-Trend-Seasonality) models
- We also looked at Simple and Exponentially Weighted Moving Averages (SMA & EWMA) for time series analysis
Have any questions about Python for time series analysis?
Let me know in the comments below.