In this article, we're going to going to train a k-means clustering algorithm to group companies based on their stock market movements over a 2-year period.

The goal of the project will be to find similarities amongst companies that we might otherwise not be able to detect. To do this, the k-means clustering algorithm will produce labels that assign each company to different clusters.

The k-means clustering algorithm is part of the unsupervised learning family, and is defined as follows:

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

The following article is based on this course: Learn Machine Learning by Building Projects, and is organized as follows:

1. Imports & Data
2. Exploratory Data Analysis (EDA)
3. K-Means Clustering
4. Principle Component Analysis (PCA)
5. Summary of Stock Market Clustering with K-Means

## 1. Imports & Data

The data source we'll be using for the companies will be Yahoo Finance and we'll read in the data with pandas-datareader.

Before we import our data from Yahoo Finance let's import the initial packages we're going to need, and we'll import the machine learning libraries later on.

import pandas_datareader.data as web
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import datetime

Next, we're going to define a dictionary with the companies that we're going to be clustering. We're going to use 28 companies across several industries.

Since this is a dictionary we're then going to pull out the company names with the sorted() function and pass in companies_dict.items() as well as an inline lambda function with index 1, which is the stock ticker.

# define instruments to download
companies_dict = {
'Amazon': 'AMZN',
'Apple': 'AAPL',
'Walgreen': 'WBA',
'Northrop Grumman': 'NOC',
'Boeing': 'BA',
'Lockheed Martin':'LMT',
'McDonalds': 'MCD',
'Intel': 'INTC',
'Navistar': 'NAV',
'IBM': 'IBM',
'Texas Instruments': 'TXN',
'MasterCard': 'MA',
'Microsoft': 'MSFT',
'General Electric': 'GE',
'Symantec': 'SYMC',
'American Express': 'AXP',
'Pepsi': 'PEP',
'Coca Cola': 'KO',
'Johnson & Johnson': 'JNJ',
'Toyota': 'TM',
'Honda': 'HMC',
'Mitsubishi': 'MSBHY',
'Sony': 'SNE',
'Exxon': 'XOM',
'Chevron': 'CVX',
'Valero Energy': 'VLO',
'Ford': 'F',
'Bank of America': 'BAC'
}

companies = sorted(companies_dict.items(), key=lambda x: x)

Next we're going to define the data source we're going to use, which in this case will be Yahoo Finance.

We're also going to define the start and end dates - we're going to use 2 years of data from 2017-01-01 to 2019-01-01.

Then we're going to use web.DataReader() to load the companies we're interested in. In this case, we use companies_dict.values().

# Define which online source to use
data_source = 'yahoo'

# define start and end dates
start_date = '2017-01-01'
end_date = '2019-01-01'

panel_data = web.DataReader(list(companies_dict.values()), data_source, start_date, end_date)

print(panel_data.axes)

Now that we have our data let's define the stock open and close values, and take a look at the close for each company on 2017-01-03.

# Find Stock Open and Close Values
stock_close = panel_data['Close']
stock_open = panel_data['Open']

print(stock_close.iloc)

Now let's move on to calculating the daily stock movements, since it's off of this movement that we want to cluster our data.

To do this we're first going to convert our stock_open and stock_close values to a numpy array. Also, since these values are currently a column vector for each day we're going to use .T to take the transpose of these and give us row vectors instead.

We're then going to create a movements dataset, and we'll start with a blank numpy array filled with 0's for now.

We're then going to write a for loop to assign the daily movement (stock_close - stock_open) for all the dates.

# Calculate daily stock movement
stock_close = np.array(stock_close).T
stock_open = np.array(stock_open).T

row, col = stock_close.shape

# create movements dataset filled with 0's
movements = np.zeros([row, col])

for i in range(0, row):
movements[i,:] = np.subtract(stock_close[i,:], stock_open[i,:])

To make sure we did this correctly let's write another for loop to print out the movement for each company on the first date.

for i in range(0, len(companies)):
print('Company: {}, Change: {}'.format(companies[i], sum(movements[i][:])))

Now that we've imported and set up our dataset, let's do some exploratory data analysis before we apply the k-means clustering algorithm.

## 2. Exploratory Data Analysis (EDA)

Exploratory data analysis is an important step in any machine learning project because the better we understand our data the more effective our methods can be.

We're going to use matplotlib to plot the stock movements of the first 2 companies: AAPL and AMZN.

plt.figure(figsize=(18,16))
ax1 = plt.subplot(221)
plt.plot(movements[:])
plt.title(companies)

plt.subplot(222, sharey=ax1)
plt.plot(movements[:])
plt.title(companies)
plt.show()

What we can see from these two stocks is that we have different scales between the price movements.

This means we need to do a normalization step before we apply k-means clustering. If we don't do this the algorithm would just cluster based on the price of the stock.

To do this we're going to use Normalizer() from sklearn.preprocessing, and then we'll print out the new minimum movement value, the maximum, and the mean.

# import Normalizer
from sklearn.preprocessing import Normalizer
# create the Normalizer
normalizer = Normalizer()

new = normalizer.fit_transform(movements)

print(new.max())
print(new.min())
print(new.mean())

Let's now plot out the movements of AAPL and AMZN again and see how they've changed:

As we can see we have much more even movements now that we've normalized the data.

## 3. K-Means Clustering

Even though we've just normalized the data, we're going to use normalize again in a pipeline just to see how pipelines work in scikit-learn.

We're then going to create a k-means model with 10 clusters. Finally we'll make a pipeline that chains together the normalizer and thek-means clustering algorithm.

# import machine learning libraries
from sklearn.pipeline import make_pipeline
from sklearn.cluster import KMeans

# define normalizer
normalizer = Normalizer()

# create a K-means model with 10 clusters
kmeans = KMeans(n_clusters=10, max_iter=1000)

# make a pipeline chaining normalizer and kmeans
pipeline = make_pipeline(normalizer,kmeans)

After we've compiled this let's fit the pipeline to the daily stock movements.

# fit pipeline to daily stock movements
pipeline.fit(movements)

To check how well the algorithm did let's use print(kmeans.inertia_).

Intertia is a score of how close each cluster is, so a lower inertia score is better. In this case, we get a score of 7.71.

Now we're going to actually predict the cluster labels.

So the question is: based off the movements, which cluster should we assign the company to?

To visualize this we'll create a DataFrame that aligns the labels to the companies and then print them out.

# predict cluster labels
labels = pipeline.predict(movements)

# create a DataFrame aligning labels & companies
df = pd.DataFrame({'labels': labels, 'companies': companies})

# display df sorted by cluster labels
print(df.sort_values('labels'))

At first glance this looks pretty good, we can see banks clustered together, tech stocks clustered together, Coca Cola and Pepsi clustered together, amongst others.

There are a few that don't necessarily make sense, but that is to be expected since we're just clustering based on movement.

Let's now move on to PCA.

## 4. Principal Component Analysis (PCA)

We are now going to do a linear dimensionality reduction using singular value decomposition of the data.

We're going to do this to project it to a lower dimensional space so that we can graphically represent the different clusters.

We're first going to use PCA from sklearn.decomposition, and then we're going to run a k-means clustering algorithm on the reduced data and compare it to our previous results.

The number of components we're going to use is 2 because we want to plot it on 2-dimensional graph.

We're not going to use a pipeline this time so we're just going to pass in new to fit_transform(), which is our normalized data from earlier.

# PCA
from sklearn.decomposition import PCA

# visualize the results
reduced_data = PCA(n_components = 2).fit_transform(new)

# run kmeans on reduced data
kmeans = KMeans(n_clusters=10)
kmeans.fit(reduced_data)
labels = kmeans.predict(reduced_data)

# create DataFrame aligning labels & companies
df = pd.DataFrame({'labels': labels, 'companies': companies})

# Display df sorted by cluster labels
print(df.sort_values('labels'))

We can see we still have some of the tech stocks clustered together, the defense companies are clustered, and energy companies are also clustered.

The previous clustering does look a bit better than these results, but this still does a decent job.

The reason we're doing PCA though is that we can graphically represent it, so let's plot this out with np.meshgrid:

# Define step size of mesh
h = 0.01

# plot the decision boundary
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:,0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain abels for each point in the mesh using our trained model
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)

# define colorplot
cmap = plt.cm.Paired

# plot figure
plt.clf()
plt.figure(figsize=(10,10))
plt.imshow(Z, interpolation='nearest',
extent = (xx.min(), xx.max(), yy.min(), yy.max()),
cmap = cmap,
aspect = 'auto', origin='lower')
plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=5)

# plot the centroid of each cluster as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidth=3,
color='w', zorder=10)

plt.title('K-Means Clustering on Stock Market Movements (PCA-Reduced Data)')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.show()

Here we can see a meshgrid with 10 clusters and the centers of each cluster are plotted with a white X.

## 5. Summary of Stock Market Clustering with K-Means

To summarize, in this article we looked applying k-means cluster, which is a popular unsupervised learning technique, to a group of companies.

We first imported the data using pandas-datareader and Yahoo Finance for 28 stocks for a 2 year period from 2017 to 2019.

We then calculated each stocks daily movement from the open and close values.

Following this, we visualized the stock market movements and saw that we needed to normalize our data.

We then used the k-means clustering algorithm on our normalized data to predict the label of each company and assigned them to 10 different clusters.

We then saw how we can reduce the dimensionality of our data to two dimensions with principal component analysis (PCA) and plot it based on the clusters.