The term machine learning is thrown around a lot these days—in this guide we'll discuss exactly what the it means, a brief history or AI, as well as a 7 step framework for machine learning projects.
Before we get into machine learning, the first thing to grasp is just how much data is generated each and every day. As the data science company DOMO reports:
Over 2.5 quintillion bytes of data are created every single day, and it’s only going to grow from there. By 2020, it’s estimated that 1.7MB of data will be created every second for every person on earth.
Machine learning, and more generally data science, is one of the ways that we can make sense of all that data.
This guide to machine learning is organized as follows:
- What is Machine Learning?
- A Brief History of AI
- The 7 Steps of Machine Learning
- Introduction to Machine Learning Algorithms
- The Main Algorithms in Machine Learning
- Training and Testing Machine Learning Models
- Machine Learning Model Evaluation Metrics
- Machine Learning Model Selection
- Summary of Machine Learning
- Machine Learning Tools & Resources
1. What is Machine Learning?
Machine learning gives us the ability to understand, interact with, and make decisions with data. Machine learning is all about teaching computers how to learn from past experiences, or data.
In the early days of computing, we had to explicitly tell machines what to do and how to do it, which inherently limited what kind of operations they could perform.
Recent advancements in the field have changed this and we can now create systems that learn what to do by sorting through huge amounts of data and finding fresh patterns and new insights.
In short, machine learning refers to teaching computers to learn from data and dynamically update its own parameters without being explicitly told what to do.
2. A Brief History of AI
Before we dive deeper, let's review a brief history of artificial intelligence to understand why machine learning is so important right now.
1950 - 1956: The Early Days
The foundations of artificial intelligence certainly come from the computer pioneer and AI theorist: Alan Turing.
During the Second World War, Turing was tasked with cracking the 'Enigma' code that the Germany used to send encrypted messages. The Bombe Machine they ended up creating essentially laid the foundations for machine learning.
Even in these early days, Turing was contemplating the question: can machines think?
The term 'artificial intelligence' hadn't yet been coined, but in 1950 Alan Turing developed the famous Turing Test, which...
...is a test of a machine's ability to exhibit intelligent behavior, equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses.
1956: AI is Coined
The term 'artificial intelligence' was coined by computer scientist John McCarthy at the Dartmouth Conference.
Although there was a sentiment that AI was achievable, there was a failure to agree on standard methods for the field.
The significance of the this event, however, cannot be understated as it catalyzed the next 20 years of research.
1957 - 1974: the Golden Years
Research in AI flourished after the Dartmouth conference, it was truly an age of discovery.
Money poured into the field and people were astonished with the programs being developed, as few believed that "intelligent" behavior was at all possible for a machine.
A few notable programs include:
- Researching as search - achieving some goal (like beating a game) by proceeding step by step as if searching through a maze and backtracking when necessary
- Natural language - the famous ELIZA natural language processing program was created at MIT in 1964-66
- Micro-worlds - this was the idea that AI research should focus on artificially simple situations known as 'micro-worlds'
1974 - 1980: the First AI Winter
This was the first period that AI was subject to a number of critiques and setbacks.
Essentially researchers didn't predict just how difficult the problems they were trying to solve were.
As a few of the overly-optimistic promises failed to materialize, funding for AI research dried up.
1980 - 1987: the Second Boom
As the markets were roaring, so was AI research again.
This is primarily due to the rise of an AI program called "expert systems" - which are programs that answer questions and solve problems by using logic rules that are defined by experts in the field.
Corporations around the world adopted these expert systems, and the focus of AI research was on extracting knowledge into these programs.
1987 - 1993: the Second AI Winter
With the crash of '87, so too did the business community's interest in AI fall and dry up funding once again.
Just as in the first AI winter, advances were still made in this period. At the time several researchers advocated the importance of a machine having a body to interact with the world.
This new approach to AI was based on robotics, as researchers advocated building artificial intelligence "from the ground up".
1993 - 2011: Artificial Intelligence Arrives
In this period the field of AI finally achieves some of its original goals.
The technology began to be successfully used in industry, although it was kind of behind the scenes.
Successes came from increased computing power and isolating specific problems to achieve.
The AI reputation in the business world wasn't what it used to be, but the field was quietly making significant advances.
A notable event of this time was Deep Blue becoming the first chess-playing computer system to beat Garry Kasparov, the world chess champion at the time.
2011 - Present: Deep Learning, Big Data, and AGI
Now that the internet had really caught on and the rise of social media ensued, access to 'Big Data' had arrived, which...
...refers to data sets that are too large or complex for traditional data-processing, application software to adequately deal with.
Along with big data, faster computers and advanced machine learning techniques were successfully applied to many different problems. Specifically, advances in Deep Learning allowed us to generate much more complex models compared to previous algorithms. State-of-the-art deep neural networks have been shown to rival human accuracy in certain fields.
A notable event in this period was DeepMind's AlphaGo victory again world champion Lee Sedol.
Artificial General Intelligence (AGI)
The next major milestone we haven't achieved yet is AGI, also referred as "strong AI", which refers to...
the intelligence of a machine that could successfully perform any intellectual task that a human being can.
That discussion is a topic for another article though.
3. Why Machine Learning Matters
Now that we have an overview of the history of AI, let's discuss why machine learning is so important.
It's important to discuss this in order to better understand the intrinsic value the field can bring to businesses. In a nutshell, machine learning matters because it gives us a process for creating solutions to extremely complex problems.
With the amount of data being created every second of every day, it's just not feasible to answer these questions by manually analyzing the data, or to manually specifying exactly how a program should solve a certain problem.
The field of machine learning provides the necessary tools to answer questions and make decisions with our data that are:
- Automatic - we're creating automated processes for learning and creating useful algorithms with our data
- Efficient - after we've trained our model, machine learning can save you a huge amount of time that would have been spent manually classifying out data (for example, directing emails to the appropriate department)
- Accurate - machine learning has been shown to outperform us in many specific, repetitive tasks. The models can be trained on much more data, and can be always running.
- Scalable - if we're talking about manually classifying our data, sometimes this just is not feasible because there is so much. Machine learning can provide us with a solution to these cases.
Now that we know why machine learning matters, let's look at the 7 steps of machine learning.
4. The 7 Steps of Machine Learning
As we have seen, machine learning has granted computers entirely new capabilities.
But what are the actual steps to perform machine learning tasks?
Let's take it step-by-step with an example.
Let's say we've been tasked to classify images of flowers into the appropriate species.
This image classification system we build is called a model.
This model is created through a process called training.
The goal of machine learning is then to create an accurate model that correctly classifies the flowers (most of the time).
In order to train our model we need data that we can use for training, which leads us to step 1.
Step 1: Gather the Data
This step is very important because the quality and quantity of data will directly determine how good our predictive model will be.
Luckily, this example has one of the most common datasets available: the IRIS Data Set from the UCI Machine Learning Repository. In this case, our data is a labelled dataset. This dataset use several features in order to classify the images, an example of two features could be petal length and petal width.
Step 2: Data Preparation
Now that we have a dataset, we need to load it into a suitable environment to prepare it for training our machine learning model. To do this we bring all the data into one place and randomize it since the order of flower images is not relevant to determining it's species.
This is also a good time to use data visualization techniques in order to get familiar with our data and see if there are any relevant relationships between our variables.
We then need to split our dataset into two parts. This first part is used for training our model, and will make up the largest portion of the dataset. The second part is used for evaluating our trained model's performance.
Step 3: Choose a Model
There many different models that have been created over the years. Each model is generally well suited for a particular data type - such as image data, sequenced data, text data, or numerical data.
In this case the IRIS dataset contains image data. For this we'll look at convolutional neural network models, since they are typically well suited for image data.
Step 4: Train the Model
In this step we use the training data to incrementally improve the model's ability to correctly predict the flower species. If you want to learn more about the specifics of how convolutional neural networks train models, check out our Guide to Convolutional Neural Networks.
As you can imagine, the model does not do very well at first. As we compare the models output, with the output it should have produced and adjust the models parameters we can improve accuracy over time.
With each iteration, the model updates its weights and biases - this is referred to as one epoch, or one training step.
Step 5: Model Evaluation
Once we complete training the model, we need to evaluate it.
We use the portion of our dataset that we set aside earlier, which allows us to test how our model might perform on unseen data. The evaluation data is meant to be representative of how the model will perform in the real-world.
Generally a dataset will be split into either 70-80% for training and 20-30% for evaluation, depending on the size of the original dataset.
Step 6: Hyperparameter Tuning
After evaluating our trained model we then want to see if we can further improve the model by tuning our model's hyperparameters.
One example of tuning the parameters is changing the number of epochs we use to train our data. This means we show the training data to the model a different number of times.
Another parameter we can tune is the learning rate. The learning rate defines how far we shift our models weights at each step based on the feedback from the previous training step.
Tuning parameters is very much an art rather than a science and should be treated as an experimental process.
Step 7: Prediction
It's finally time to use our model to do something useful and perform the intended task. Machine learning is using data to answer questions, which in this example is: what type of flower species does a given image contain?
Prediction, or inference, is the step where we answer our questions. This is where the value of machine learning is realized. The power of machine learning is that we were able to train a model to classify image data, without needing an expert in botany to use their judgement.
The same principle's we've just covered apply to other questions we want answered.
To Summarize, the 7 Steps of Machine Learning include:
- Gathering data
- Preparing the data
- Choosing a model
- Training the model
- Evaluating the model
- Hyperparameter Tuning
5. Introduction to Machine Learning Algorithms
Before we get into specific machine learning algorithms, let's take a look at algorithms grouped by their learning style.
The 3 learning algorithm styles include:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
There are many different ways an algorithm can model a problem, but in order to organize them we need to understand how we are going to deal with our input data.
The key aspect of supervised learning is that our input data contains one important feature: labels. Examples of labels could be if a email is spam or not, whether a picture is a cat or dog, or a stock price at a particular time.
With supervised learning we create our model using a training process where we provide training data, and testing data. In the training process the learning algorithms are making predictions, and then referring to the label, and correcting itself when those predictions are wrong.
This iterative training process continues until the model has achieved a satisfactory level of performance.
2. Unsupervised Learning
In an unsupervised learning problem our input data is not labeled and doesn't have a known result. We use unsupervised learning techniques in order to try and identify structures in our input data.
We then observe and learn from the patterns that the algorithm identifies, which allows us to visualize groups of data points. This could be organize data, find similarities/ differences in our data, or to find fresh new patterns from our unlabelled data.
Examples of unsupervised learning are clustering, dimensionality reduction, and anomaly detection.
3. Reinforcement Learning
Reinforcement learning sits somewhere in between supervised and unsupervised learning.
From our Guide to Reinforcement Learning:
In Reinforcement Learning, we have time-delayed labels that are sparse. From these labels, which we can call rewards, we can learn to operate in this uncertain environment.
Instead of predicting known values (in the case of supervised learning), or looking for patterns (in unsupervised learning):
The goal of reinforcement learning is to choose the optimal action which will maximize the long-term expected reward provided by the environment.
6. The Main Algorithms in Machine Learning
Now that we know what machine learning is at a high level, and understand the 7 steps of machine learning, let's look at a few of the most popular algorithms that are used.
A decision tree uses a flowchart-like model of decisions in which each internal node is a test on an attribute (i.e. whether a coin will be heads or tails), each branch represents the outcome of the test, and each node is a class label.
The branches can also contain the probability of event outcomes.
There won't always be a decision tree that perfectly fits our data, but we can use machine learning to find the best fitting tree for a given table of data.
Here's the definition from Wikipedia:
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set.
An example of the Naive Bayes classifier algorithm in machine learning is for detecting spam emails. The classifier looks at the content of the email and determines the probability of it being spam - for example if it contains the word "cheap", it has a much higher likelihood of being spam.
Gradient descent is an iterative algorithm that finds the local minimum of a function.
Linear regression is an approach to modeling the relationship between a dependent variable (or scalar response) and one or more independent (explanatory) variables. You can think of linear regression like a painter who tries to draw a best fit line given a set of data points.
So how does a computer find this line?
It starts by drawing a random line and measuring how bad it is. In order to see how bad the line is we calculate the error. We look at the length of the line to the different data points, and the error of the line is the sum of these lengths. By moving the line around we can reduce the error. If the error increases we know that the line is worse. We continue this procedure until we minimize the error, and this is known as gradient descent.
Logistic regression uses a logistic function to explain the relationship between one dependent binary variable and one or more independent variable.
Support Vector Machines
Support vector machines (SVMs) are supervised learning models that analyze data in order to either perform classification or regression analysis.
Here's an example of why you would use it from this article:
It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs. Simply put, it does some extremely complex data transformations, then figures out how to separate your data based on the labels or outputs you've defined.
Vaguely inspired by the biological neural networks in our brain, a neural network is not itself an algorithm, but rather a framework for machine learning algorithms to learn how to perform tasks by considering examples.
Kernel methods are a class of algorithms for pattern analysis, whose general task is to find different types of relationships in our data - for example clustering, correlations, classifications, etc.
K-means clustering is a method of vector quantization, which allows the modeling of probability density, and is popular for cluster analysis in data mining.
Hierarchical clustering is a method of cluster analysis that looks to build a hierarchy of clusters.
There are many more algorithms used in machine learning, but these are a few of the most common ones.
7. Training and Testing Machine Learning Models
In this section we'll describe the different metrics we can use to answer these questions:
- How good is my model?
- How do we improve the model based on these metrics?
In machine learning we have a problem to solve, which is generally related to evaluating data and making predictions. In order to solve the problem we have a few tools at our disposal. The tools are the algorithms, like the ones mentioned above.
How do we know which model will work best?
To answer that we use measurement tools.
We'll now look at how to train, test, evaluate, and validate our models in order to make the best decisions with our data.
Let's first recall the definitions of regression and classification, since we'll be using them quite a bit.
- A regression model predicts a numeric value, like 3, 0.4, you get the idea
- A classification model predicts a state, such as positive or negative, yes or no, or hotdog not hotdog
In determining the best model, we want one that generalizes well to unseen data and doesn't overfit the data.
To do this we split out dataset into two sets - the training set and the testing set. We split our data into training and testing sets for both regression and classification problems.
Here's what this looks like in Python using the popular machine learning library scikit-learn with a test size of 20%:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
8. Machine Learning Model Evaluation Metrics
After we develop a model we want to find out how well it's performing. This is a difficult question, but there are certain evaluation metrics we can use to help answer this.
First let's start with evaluation metrics that can be used for classification models.
A confusion matrix is a table layout that allows you to visualize of the performance of an algorithm. From Wikipedia:
Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa).
An important definition in measuring classification algorithms is:
- True & False Positives
- True & False Negatives
From this Google Machine Learning crash course:
- A true positive is an outcome where the model correctly predicts the positive
- A true negative is an outcome where the model correctly predicts the negative class.
- A false positive is an outcome where the model incorrectly predicts the positive class.
- A false negative is an outcome where the model incorrectly predicts the negative class.
Here is a confusion matrix depicted the four possible outcomes summarizing the Boy Who Cried Wolf:
In a classification problem, accuracy is the number of correct predictions made divided by the total number of predictions made, multiplied by 100 to turn it into a percentage.
- Accuracy = Correct Predictions / Total Predictions * 100
This is a great article on classification accuracy by Machine Learning Mastery.
Accuracy can be calculated using sci-kit learn with:
from sklearn.metrics import accuracy_score
Precision is the number of True Positives divided by the number of True Positives and False Positives.
- Precision = True Positives / True Positives + False Positives
Recall is the number of True Positives divided by the number of True Positives and the number of False Negatives.
- Recall = True Positives / True Positives + False Negatives
The F1 Score is as follows:
- 2 * ((Precision * Recall) / (Precision + Recall))
The F1 score gives us the balance between our precision and recall values.
If beta = 0, we get precision and if beta = infinity we get recall.
For other values of beta:
- If the number is close to 0 we get closer to precision
- If the number is large we would get closer to recall
The ROC curve - or Receiver Operating Characteristic curve - is a graphical plot that illustrates the true positive rate against the false positive rate at various thresholds.
Here's an example of a ROC curve from Wikipedia:
We won't cover them in detail in this article, but here are several metrics that can be used to evaluate regression models.
Mean absolute error measures the difference between two continuous variables.
- We can use the
mean_absolute_errorfunction from sklearn to calculate this.
Mean squared error measures the average of the squares of errors. It is the average squared difference between the estimated values and what is estimated.
- We can use for
mean_squared_errorfrom sklearn to calculate this.
The coefficient of determination - denoted R2 or r2 - is the proportion of the variance in a dependent variable that is predictable from the independent variable.
- We can use the
r2_scorefrom sklearn to calculate this.
9. Machine Learning Model Selection
Types of Errors
If machine learning, there are two main types of errors we make:
- Oversimplifying the problem we're trying to solve, which is referred to as underfitting
- Overcomplicating the problem we're trying to solve, referred to as overfitting
Let's look at each of these in a bit more detail.
- One characteristic of underfitting a classification problem is the model not doing well on the training set.
- We call this type of error an error due to bias.
- One characteristic of overfitting is it does well on the training set, but it basically memorizes instead of learning the characteristics of it.
- We call this type of error an error due to variance.
How do we detect errors?
One way is with a model complexity graph, here is a great article the subject, which describes it as a graph that:
Compares the training errors and the cross-validation errors in order to measure if a certain model either overfits or underfits the dataset that it has been exposed to.
From the same article, here's what building a model complexity graph looks like:
On the X axis we start with a linear model and in this case go up to a Polynomial of degree 14.
In this example the polynomial of degree 14 is overfitting, the Linear model is underfitting, and the polynomial of degree 14 seems to be a good fit.
Instead of just having a training and testing set for our data, we can also add a cross validation set. Now...
- The training set will be using for training the parameters
- The cross validation set will be used for making decisions about the model, such as the degree of the Polynomial
- The testing set will be used for the final testing of the model
A useful method to recycle our data is called K-Fold Cross Validation.
As we know, we split our data into training and testing sets...but this isn't always ideal because we could be throwing away the data in the test set that could be useful for training our algorithm.
So is there a way to not 'throw away' this test data?
This is where K-Fold Cross Validation comes in, what we do is:
- Split our data into k-fold buckets
- Then we train our model k times, each time using a different bucket as our testing set and the remaining data as our training set
- We then average the results to get a final model.
The function to do this is sklearn is
from sklearn.model_selection import KFold, and this is a great article on how to apply it to an example.
Another way to discover overfitting, underfitting, and a good fitting model is with a learning curve.
You can find a comprehensive guide on Learning Curves for Machine Learning in this article by DataQuest.
Here's a summary of what we do in machine learning:
- We train a bunch of models with our training data
- We then use our cross-validation data to find the best of these models
- Then we test it with our testing data to make sure the model is good
In terms of tuning/optimizing the hyperparameters of our models there are many algorithms we can use, although one of the simple strategies is called grid search.
As this article on Hyperparameter Optimization from DataCamp describes:
Grid search is an approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.
We can use Grid Search in sklearn with from
sklearn.model_selection import GridSearchCV.
10. Summary: What is Machine Learning?
This is how we look at a machine learning:
- We have a problem to solve, which is usually data we need to classify
- We have tools to fix it, which are the machine learning algorithms like logistic regression, neural networks, support vector machines, decision trees, etc.
- We have measurement tools which we use to measure each tool's performance for the given problem like Model Complexity Graphs, Accuracy, Precision, Recall, F1 Score, Learning Curves, etc.
After we've found the best one, that's what we use to model our data and make predictions.
11. Machine Learning Tools & Resources
Now that you have a foundational framework of machine learning, here is a curated list of resources that have helped me on the journey to becoming a Machine Learning Engineer:
- Reinforcement Learning: An Introduction, Richard S. Sutton
- Monetizing Machine Learning, Amunategui & Roopaei
- Machine Learning Engineer Nanodegree: Udacity
- Deep Learning Specialization: deeplearning.ai on Coursera
- Mathematics for Machine Learning Specialization: Imperial College of London
Blogs & Tutorials