AI and machine learning are rapidly becoming an essential part of our daily lives, even in many people have no idea they're interacting with the technology.
In this guide, we'll discuss exactly what machine learning means, a brief history of AI, and a general 7 step framework for machine learning projects.
Before we get into machine learning, the first thing to grasp is just how much data is generated each and every day. As the data science company DOMO reports:
Over 2.5 quintillion bytes of data are created every single day, and it’s only going to grow from there.
Machine learning, and more generally data science, is one of the ways that we can make sense of all that data.
This article is based on notes from Udacity's Machine Learning Engineer Nanodegree is organized as follows:
- What is Machine Learning?
- A Brief History of AI
- The 7 Steps of Machine Learning
- Introduction to Machine Learning Algorithms
- The Main Algorithms in Machine Learning
- Training and Testing Machine Learning Models
- Machine Learning Model Evaluation Metrics
- Machine Learning Model Selection
- Summary of Machine Learning
- Machine Learning Tools & Resources
1. What is Machine Learning?
Machine learning gives us the ability to understand, interact with, and make decisions with data. Machine learning is all about teaching computers how to learn from past experiences, or how to learn from data.
In the early days of computing, programmers had to explicitly tell machines what to do and how to do it, which inherently limited what kind of operations they could perform.
Recent advancements in the field of machine learning have changed this and we can now create systems that learn what to do by sorting through huge amounts of data, finding patterns, and extracting insights.
Machine learning is the process of teaching computers to learn from data and dynamically update its own parameters without being explicitly told what to do.
Stay up to date with AI
2. A Brief History of AI
Before we dive deeper, let's review a brief history of artificial intelligence to understand why machine learning is so important right now.
1950 - 1956: The Early Days
The foundations of artificial intelligence certainly come from the computer pioneer and AI theorist: Alan Turing.
During the Second World War, Turing was tasked with cracking the 'Enigma' code that Germany used to send encrypted messages. The Bombe Machine they ended up creating essentially laid the foundations for machine learning.
Even in these early days, Turing was contemplating the question: can machines think?
The term "artificial intelligence" hadn't even yet been coined at this time, but in 1950 Alan Turing developed the famous Turing Test, which:
...is a test of a machine's ability to exhibit intelligent behavior, equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses.
1956: The Term AI is Coined
The term "artificial intelligence" was coined by computer scientist John McCarthy at the Dartmouth Conference.
Although there was a sentiment that AI was achievable, there was a failure to agree on standard methods for the field.
The significance of this event, however, cannot be understated as it catalyzed the next 20 years of research.
1957 - 1974: the Golden Years
Research in AI flourished after the Dartmouth conference and it was truly an age of discovery.
Money poured into the field and people were astonished with the programs being developed, as few believed that "intelligent" behavior was at all possible for a machine.
A few notable programs include:
- Researching as search: This refers to achieving some goal (like beating a game) by proceeding step-by-step as if searching through a maze and backtracking when necessary
- Natural language: The famous ELIZA natural language processing program was created at MIT in 1964-66
- Micro-worlds: This was the idea that AI research should focus on artificially simple situations known as "micro-worlds"
1974 - 1980: the First AI Winter
This was the first period that AI was subject to a number of critiques and setbacks.
Essentially researchers didn't predict just how difficult the problems they were trying to solve were.
As a few of the overly optimistic promises failed to materialize, funding for AI research dried up.
Despite these setbacks, there were still advances in logic programming, commonsense reasoning, and other areas.
1980 - 1987: the Second AI Boom
As the markets were roaring, so too was AI research once again.
This is primarily due to the rise of an AI program called "expert systems", which are programs that answer questions and solve problems by using logic rules that are defined by experts in the field.
Corporations around the world adopted these expert systems, and the focus of AI research was on extracting knowledge into these programs.
1987 - 1993: the Second AI Winter
With the crash of '87, the business community's interest in AI also fell and funding dried up once again.
Just as in the first AI winter, advances were still made in this period. At the time several researchers advocated the importance of a machine having a body to interact with the world.
This new approach to AI was based on robotics, as researchers advocated building artificial intelligence "from the ground up".
1993 - 2011: Artificial Intelligence Arrives
In this period the field of AI finally achieves some of its original goals.
The technology began to be successfully applied in industry, although it was kind of behind the scenes.
Successes came from increased computing power and isolating specific problems to achieve.
The AI reputation in the business world wasn't what it used to be, but the field was quietly making significant advances.
A notable event of this time was Deep Blue becoming the first chess-playing computer system to beat Garry Kasparov, the world chess champion at the time.
2011 - Present: Deep Learning, Big Data, and AGI
Now that the internet had really caught on and the rise of social media ensued, access to 'Big Data' had arrived, which:
...refers to data sets that are too large or complex for traditional data-processing, application software to adequately deal with.
Along with big data, faster computers and advanced machine learning techniques were successfully applied to many different problems. Specifically, advances in Deep Learning allowed us to generate much more complex models compared to previous algorithms.
State-of-the-art deep neural networks have been shown to rival human accuracy in certain fields.
A notable event in this period was DeepMind's AlphaGo victory again world champion Lee Sedol.
Artificial General Intelligence (AGI)
The next major milestone we haven't achieved yet is AGI, also referred as "strong AI", which refers to...
the intelligence of a machine that could successfully perform any intellectual task that a human being can.
While there is still a lot of debate about when we may achieve AGI, this achievement will undoubtedly change our lives forever.
3. Why Machine Learning Matters
Now that we have an overview of the history of AI, let's discuss why machine learning is so important.
It's important to discuss this in order to better understand the intrinsic value the field can bring to businesses. In a nutshell, machine learning matters because it gives us a process for creating solutions to extremely complex problems.
With the amount of data being created every second of every day, it's just not feasible to answer these questions by manually analyzing the data, or to manually specifying exactly how a program should solve a certain problem.
The field of machine learning provides the necessary tools to answer questions and make decisions with our data that are:
- Automatic: We're creating automated processes for learning and creating useful algorithms with our data
- Efficient: After we've trained our model, machine learning can save you a huge amount of time that would have been spent manually classifying out data (for example, directing emails to the appropriate department)
- Accurate: Machine learning has been shown to outperform us in many specific, repetitive tasks. The models can be trained on much more data, and can be always running.
- Scalable: If we're talking about manually classifying our data, sometimes this just is not feasible because there is far too much. Machine learning can provide us with a solution to these use cases.
Now that we know why machine learning matters, let's look at the 7 step framework for machine learning projects.
4. The 7 Steps of Machine Learning
As we have seen, machine learning has granted computers entirely new capabilities.
But what are the actual steps to complete a machine learning project or task?
Let's take it step-by-step with an example.
Let's say we've been tasked to classify images of flowers into the appropriate species.
This image classification system we build is called a model.
This model is refined through a process called training.
The goal of machine learning is then to create an accurate model that correctly classifies the flowers.
In order to train our model we need data that we can use for training, which leads us to step 1...
Step 1: Gather the Data
This first step is of utmost importance because the quality and quantity of data will directly determine how good our predictive model will be.
Luckily, this example has one of the most common datasets available: the IRIS Data Set from the UCI Machine Learning Repository. In this case, our data is a labeled dataset.
This dataset use several features in order to classify the images, an example of two features could be petal length and petal width.
Step 2: Data Preparation
Now that we have a dataset, we need to load it into a suitable environment to prepare it for training our machine learning model. To do this, we bring all the data into one place and randomize it since the order of flower images is not relevant to determine its species.
This is also a good time to use data visualization techniques in order to get familiar with our data and see if there are any relevant relationships between our variables.
We then need to split our dataset into two parts.
This first part is used for training our model and will make up the largest portion of the dataset.
The second part is used for evaluating our trained model's performance.
Step 3: Choose a Model
There are many different models that have been created over the years. Each model is generally well suited for a particular data type such as image data, sequenced data, text data, or numerical data.
In this case, the IRIS dataset contains image data. For this, we'll look at convolutional neural network models, since they are well suited for image data.
Step 4: Train the Model
In this step, we use the training data to incrementally improve the model's ability to correctly predict the flower species. If you want to learn more about the specifics of how convolutional neural networks train models, check out our Guide to Convolutional Neural Networks.
As you can imagine, the model does not do very well at first. As we compare the model's output, with the output it should have produced and adjust the model's parameters we can improve accuracy over time.
With each iteration, the model updates its weights and biases, this is referred to as one epoch, or one training step.
Step 5: Model Evaluation
Once we complete training the model, we need to evaluate it.
We use the portion of our dataset that we set aside earlier, which allows us to test how our model might perform on unseen data. The evaluation data is meant to be representative of how the model will perform in the real world.
Generally, a dataset will be split into either 70-80% for training and 20-30% for evaluation, depending on the size of the original dataset.
Step 6: Hyperparameter Tuning
After evaluating our trained model we then want to see if we can further improve the model by tuning our model's hyperparameters.
One example of tuning the parameters is changing the number of epochs we use to train our data. This means we show the training data to the model a different number of times.
Another parameter we can tune is the learning rate.
The learning rate defines how far we shift our model's weights at each step based on the feedback from the previous training step.
Tuning parameters is very much an art rather than a science and should be treated as an experimental process.
Step 7: Prediction
It's finally time to use our model to do something useful and perform the intended task. Machine learning is about using data to answer questions, which in this example is: what type of flower species does a given image contain?
Prediction, or inference, is the step where we answer our questions. This is where the value of machine learning is realized. The power of machine learning is that we were able to train a model to classify image data, without needing an expert in botany to use their judgment.
The same principles we've just covered apply to other questions we want to answer.
To summarize, the 7 steps of machine learning include:
- Gathering data
- Preparing the data
- Choosing a model
- Training the model
- Evaluating the model
- Hyperparameter Tuning
5. Introduction to Machine Learning Algorithms
Before we get into specific machine learning algorithms, let's take a look at algorithms grouped by their learning style.
The 3 main families of learning algorithms include:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
There are many different ways an algorithm can model a problem, but in order to organize them, we need to understand how we are going to deal with our input data.
The key aspect of supervised learning is that our input data contains one important feature: labels.
Examples of labels could be if an email is spam or not, whether a picture is a cat or dog, or a stock price at a particular time.
With supervised learning, we create our model using a training process where we provide training data and testing data.
In the training process, the learning algorithms are making predictions, and then referring to the label, and correcting itself when those predictions are wrong.
This iterative training process continues until the model has achieved a satisfactory level of performance.
In an unsupervised learning problem, our input data is not labeled and doesn't have a known result.
We use unsupervised learning techniques in order to try and identify structures in our input data.
We then observe and learn from the patterns that the algorithm identifies, which allows us to visualize groups of data points. This could be to organize data, find similarities/ differences in our data, or find fresh new patterns from our unlabelled data.
Examples of unsupervised learning are clustering, dimensionality reduction, and anomaly detection.
Reinforcement learning sits somewhere in between supervised and unsupervised learning. As described in our Guide to Reinforcement Learning:
In reinforcement learning, we have time-delayed labels that are sparse. From these labels, which we can call rewards, we can learn to operate in this uncertain environment.
Instead of predicting known values (in the case of supervised learning), or looking for patterns (in unsupervised learning):
The goal of reinforcement learning is to choose the optimal action which will maximize the long-term expected reward provided by the environment.
6. The Main Algorithms in Machine Learning
Now that we know what machine learning is at a high level, and understand the 7 steps of machine learning, let's look at a few of the most popular algorithms that are used in the field.
A decision tree uses a flowchart-like model of decisions in which each internal node is a test on an attribute (i.e. whether a coin will be heads or tails), each branch represents the outcome of the test, and each node is a class label.
The branches can also contain the probability of event outcomes.
There won't always be a decision tree that perfectly fits our data, but we can use machine learning to find the best fitting tree for a given table of data.
Here's the definition from Wikipedia:
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set.
An example of the Naive Bayes classifier algorithm in machine learning is for detecting spam emails.
The classifier looks at the content of the email and determines the probability of it being spam—for example, if it contains the word "cheap", it has a much higher likelihood of being spam.
Gradient descent is an iterative algorithm that finds the local minimum of a function.
Linear regression is an approach to modeling the relationship between a dependent variable (or scalar response) and one or more independent (explanatory) variables. You can think of linear regression like a painter who tries to draw the best fit line given a set of data points.
So how does a computer find this line?
It starts by drawing a random line and measuring how bad it is. In order to see how bad the line is we calculate the error.
We look at the length of the line to the different data points, and the error of the line is the sum of these lengths. By moving the line around we can reduce the error. If the error increases we know that the line is worse.
We continue this procedure until we minimize the error, and this is known as gradient descent.
Logistic regression uses a logistic function to explain the relationship between one dependent binary variable and one or more independent variables.
Support Vector Machines
Support vector machines (SVMs) are supervised learning models that analyze data in order to either perform classification or regression analysis.
Here's an example of why you would use it from this article:
It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs. Simply put, it does some extremely complex data transformations, then figures out how to separate your data based on the labels or outputs you've defined.
Vaguely inspired by the biological neural networks in our brain, a neural network is not itself an algorithm, but rather a framework for machine learning algorithms to learn how to perform tasks by considering examples, i.e. training data.
Kernel methods are a class of algorithms for pattern analysis, whose general task is to find different types of relationships in our data - for example, clustering, correlations, classifications, etc.
K-means clustering is a method of vector quantization, which allows the modeling of probability density, and is popular for cluster analysis in data mining.
Hierarchical clustering is a method of cluster analysis that looks to build a hierarchy of clusters.
There are many more algorithms used in machine learning, but these are a few of the most common ones.
7. Training and Testing Machine Learning Models
In this section, we'll describe the different metrics we can use to answer these questions:
- How good is my model?
- How do we improve the model based on these metrics?
In machine learning we have a problem to solve, which is generally related to evaluating data and making predictions. In order to solve the problem we have a few tools at our disposal. The tools are the algorithms, like the ones mentioned above.
How do we know which model will work best?
To answer that we use measurement tools.
We'll now look at how to train, test, evaluate, and validate our models in order to make the best decisions with our data.
Let's first recall the definitions of regression and classification, since we'll be using them quite a bit.
- A regression model predicts a numeric value
- A classification model predicts a state, such as positive or negative, yes or no, or hotdog not hotdog
In determining the best model, we want one that generalizes well to unseen data and doesn't overfit the data.
To do this we split out dataset into two sets - the training set and the testing set. We split our data into training and testing sets for both regression and classification problems.
Here's what this looks like in Python using the popular machine learning library scikit-learn with a test size of 20%:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
8. Machine Learning Model Evaluation Metrics
After we develop a model we want to find out how well it's performing. This is a difficult question, but there are certain evaluation metrics we can use to help answer this.
First let's start with evaluation metrics that can be used for classification models.
A confusion matrix is a table layout that allows you to visualize of the performance of an algorithm. From Wikipedia:
Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa).
An important definition in measuring classification algorithms is:
- True & False Positives
- True & False Negatives
From this Google Machine Learning crash course:
- A true positive is an outcome where the model correctly predicts the positive
- A true negative is an outcome where the model correctly predicts the negative class.
- A false positive is an outcome where the model incorrectly predicts the positive class.
- A false negative is an outcome where the model incorrectly predicts the negative class.
Here is a confusion matrix depicted the four possible outcomes summarizing the Boy Who Cried Wolf:
In a classification problem, accuracy is the number of correct predictions made divided by the total number of predictions made, multiplied by 100 to turn it into a percentage.
- Accuracy = Correct Predictions / Total Predictions * 100
This is a great article on classification accuracy by Machine Learning Mastery.
Accuracy can be calculated using sci-kit learn with:
from sklearn.metrics import accuracy_score
Precision is the number of True Positives divided by the number of True Positives and False Positives.
- Precision = True Positives / True Positives + False Positives
Recall is the number of True Positives divided by the number of True Positives and the number of False Negatives.
- Recall = True Positives / True Positives + False Negatives
The F1 Score is as follows:
- 2 * ((Precision * Recall) / (Precision + Recall))
The F1 score gives us the balance between our precision and recall values.
If beta = 0, we get precision and if beta = infinity we get recall.
For other values of beta:
- If the number is close to 0 we get closer to precision
- If the number is large we would get closer to recall
The ROC curve - or Receiver Operating Characteristic curve - is a graphical plot that illustrates the true positive rate against the false-positive rate at various thresholds.
Here's an example of a ROC curve from Wikipedia:
We won't cover them in detail in this article, but here are several metrics that can be used to evaluate regression models.
Mean absolute error measures the difference between two continuous variables.
- We can use the
sklearnto calculate this.
Mean squared error measures the average of the squares of errors. It is the average squared difference between the estimated values and what is estimated.
- We can use
sklearnto calculate this.
The coefficient of determination—denoted R2 or r2— is the proportion of the variance in a dependent variable that is predictable from the independent variable.
- We can use the
sklearnto calculate this.
9. Machine Learning Model Selection
With so many models to choose from, we need a framework in order to select the right one for the task at hand. Below are several considerations to make in selecting a machine learning model.
Types of Errors
If machine learning, there are two main types of errors we make:
- Underfitting refers to a model that is unable to capture the relationship between the input and output variables
- Overfitting refers to a model that fits exactly to the training data. In essence, it memorizes the training data and won't generalize well to other problems.
Let's look at each of these in a bit more detail.
- One characteristic of underfitting a classification problem is a model that does not perform well on the training set.
- We call this type of error an error due to bias.
- One characteristic of overfitting is it does well on the training set, but it basically memorizes instead of learning the characteristics of it.
- We call this type of error an error due to variance.
How do we detect errors?
One way is with a model complexity graph, here is a great article on the subject, which describes it as a graph that:
Compares the training errors and the cross-validation errors in order to measure if a certain model either overfits or underfits the dataset that it has been exposed to.
From the same article, here's what building a model complexity graph looks like:
On the X-axis we start with a linear model and in this case go up to a Polynomial of degree 14.
In this example the polynomial of degree 14 is overfitting, the Linear model is underfitting, and the polynomial of degree 14 seems to be a good fit.
Instead of just having a training and testing set for our data, we can also add a cross-validation set, which will...
- The training set will be used for training the parameters
- The cross-validation set will be used for making decisions about the model, such as the degree of the Polynomial
- The testing set will be used for the final testing of the model
A useful method to recycle our data is called K-Fold Cross-Validation.
As we know, we split our data into training and testing sets...but this isn't always ideal because we could be throwing away the data in the test set that could be useful for training our algorithm.
So is there a way to not 'throw away' this test data?
This is where K-Fold Cross Validation comes in, what we do is:
- Split our data into k-fold buckets
- Then we train our model k times, each time using a different bucket as our testing set and the remaining data as our training set
- We then average the results to get a final model.
The function to do this is sci-kit learn is
from sklearn.model_selection import KFold, and this is a great article on how to apply it to an example.
Another way to discover overfitting, underfitting, and a good fitting model is with a learning curve.
You can find a comprehensive guide on Learning Curves for Machine Learning in this article by DataQuest.
Here's a summary of what we can do to find the right machine learning model for the task at hand:
- We train several different models with our training data
- We then use our cross-validation data to find the best of these models
- Then we test it with our testing data to make sure the model is performing well
In terms of tuning/optimizing the hyperparameters of our models there are many algorithms we can use, although one of the simple strategies is called grid search.
As this article on Hyperparameter Optimization from DataCamp describes:
Grid search is an approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.
We can use Grid Search in
sklearn with from
sklearn.model_selection import GridSearchCV.
10. Summary: What is Machine Learning?
To summarize, this is how we look at machine learning projects:
- We have a problem to solve, which may be data we need to classify, a future value to predict, and so on
- We have tools to accomplish this, which include the machine learning algorithms like logistic regression, neural networks, support vector machines, decision trees, etc.
- We also have measurement tools that we use to measure each tool's performance for the given problem like Model Complexity Graphs, Accuracy, Precision, Recall, F1 Score, Learning Curves, etc.
After we've found the best one, that's what we use to model our data and make predictions.
If you want to learn more about machine learning, you can check out our other articles on the subject here.