In this article natural language processing (NLP), we'll discuss how to perform sentiment analysis with Naive Bayes.

First, we'll discuss what probabilities and conditional probabilities are, how to derive Bayes' rule, and then how to train and test the model.

This article is based on notes from week 2 of the first course in the Natural Language Processing specialization and is organized as follows:

- Probability & Bayes' Rule
- Deriving Bayes' Rule
- Introduction to Naive Bayes
- Laplacian Smoothing
- Log Likelihood
- Training Naive Bayes
- Testing Naive Bayes

*This post may contain affiliate links. See our **policy page** for more information.*

## Probability & Bayes' Rule

One way to think about probability is to simply count the frequency that events occur.

In the context of sentiment analysis, if we define and event $A$ as a tweet being labeled as positive, then the probability of events $A$ is calculated as the ratio between the count of positive tweets in the corpus divided by the total number of tweets:

$$P(A) = P(Positive) = \frac{N_{pos}}{N}$$

## Deriving Bayes' Rule

In order to derive Bayes' rule we first need to discuss the concept of conditional probabilities.

If we have a corpus of text and we only consider tweets containing the word "happy", for example, many other positive tweets will be excluded.

In this example, we can get probability that a tweet is positive given it contains the word happy as follows:

$$P(A | B) = P(Positive | "happy")$$

This is an example of a **conditional probability**

Conditional probabilities can be interpreted as the probability of an outcome B given than A has happened.

In other words, given that have a set of elements $A$ what is the probability that it also belongs to $B$.

We can get the conditional probability of the above example with the calculation below:

$$P(Positive | "happy") = \frac{P(Positive \bigcap "happy")}{P("happy")}$$

We can write a similar equation by changing the position of the two conditions:

$$P( "happy" | Positive) = \frac{P("happy" \bigcap Positive)}{P(Positive)}$$

With both of these equations we can now derive Bayes' rule in the context of sentiment analysis as follows:

$$P(Positive | "happy") = P("happy" | Positive) * \frac{P(Positive)}{P("happy")}$$

More generally, we can write the Bayes' rule as:

$$P(X | ) = P(Y | X) * \frac{P(X)}{P(Y)}$$

We'll now look at how we can apply Bayes' rule for the purpose of sentiment analysis and natural language processing.

## Introduction to Naive Bayes

In this section, we'll look at how we can classify the sentiment of a tweet using a method called Naive Bayes.

Naive Bayes is an example of supervised machine learning and is quite similar to logistic regression.

Naive Bayes is referred to as "naive" because it assumes that the features we're using are all independent.

In reality, however, this is rarely the case, although it can still be useful as a simple method of sentiment analysis.

To start, we have two corporal of text—one for positive tweets and the other for negative.

From this, we need to extract the vocabulary for each word that appears and their frequency in the corpus.

We then need to get a total count of all the words in the positive corpus and all those in the negative corpus.

This is a new step in Naive Bayes and allows you to compute the conditional probability of each word given the class.

Next, we divide the frequency of each word in a class by its corresponding sum of the the class.

We then apply the same procedure for each word in the vocabulary to get a table of conditional probabilities. If we sum over all the probabilities of each class, it will add up to 1.

Another key expression is called the Naive Bayes inference condition rule for binary classification.

This expression says that you're going to take the product across all words in your tweets of the probability of each word in the positive class divided by the probability of the negative class.

$$\Pi^{m}_{i=1}\frac{P(w_i|pos)}{P(w_i|neg)}$$

Next, we'll discuss how to overcome several limitations of Naive Bayes.

## Laplacian Smoothing

Laplacian smoothing is a technique that allows you to avoid a common issue with Naive Bayes, which occurs when the probabilities equate to zero.

Given a class, the expression to calculate the conditional probability of a word is the frequency of a word in the class:

$$P(w_i|class) = \frac{freq(w_i, class)}{N_{class}}$$

With smoothing we use a slightly different formula from the original:

$$$$P(w_i|class) = \frac{freq(w_i, class) + 1}{N_{class} + V}$$$$

where:

- $N_{class}$ is the frequency of all words in the class
- $V$ is the number of unique words in the vocabulary

By adding 1 in the numerator this avoids the probability of being zero. By adding $V$ to the denominator, we account for this new term in the numerator.

This process is known as **Laplacian smoothing.**

In summary, Laplacian smoothing is a technique to avoid $P(w_i|class) = 0$

## Log Likelihood

In this section, we'll introduce the concept of log likelihoods, which are simply logarithms of the probabilities we saw earlier.

Log likelihoods are often much easier to work with in deep learning and natural language processing.

In the context of sentiment classification, we can simplify the classes of sentiment into positive, negative, and neutral, all of which can be identified with conditional probability.

By dividing the conditional probabilities of positive and negative words, we can get a corresponding ratio in which:

- Positive words have a ratio larger than 1
- Negative words have a ratio lower than 1
- Neutral words have a ratio of 1

This ratio between positive and negative tweets is called the **prior ratio.**

These ratios are key for Naive Bayes and now give us the full formula for binary classification:

$$\frac{P(pos)}{P(neg)} \Pi^{m}_{i=1}\frac{P(w_i|pos)}{P(w_i|neg)} > 1$$

This probabilistic model for classification gives us a simple and fast baseline for sentiment analysis.

Since we are multiplying many numbers with values between 0 and 1, we have the risk of underflow, or the numbers returned being so small they can't be stored on our device.

To solve this, we can use a property of logarithms:

$$log(a * b) = log(a) + log(b)$$

$$log(\frac{P(pos)}{P(neg)}) \Pi^{m}_{i=1}\frac{P(w_i|pos)}{P(w_i|neg)}$$

This allows us to write the equation as the sum of the log prior plus the log likelihood:

$$log(\frac{P(pos)}{P(neg)}) + \sum^n_{i=1}log\frac{P(w_i|pos)}{P(w_i|neg)}$$

In order to use this method to classify tweets, we need to calculate a sentiment score. To do this, we'll calculate the log of the score called Lambda.

Lambda is the log of the ratio of the probability that your word is positive divided by the probability it is negative:

$$\lambda(w) = log\frac{P(w|pos)}{P(w|neg)}$$

If we calculate lambda for every word in our vocabulary we will now have:

- Neutral words have a $\lambda$ of 0
- Positive words have a $\lambda$ > 0
- Positive words have a $\lambda$ < 0

In summary, we can use a sentiment score for each word called Lambda, which reduces the risk of numerical underflow.

By summing each of the Lambdas in a tweet we get the log likelihood, which if larger than 0 indicates a positive tweet and less than 0 indicates a negative tweet.

$$\sum^m_{i=1} log\frac{P(w_i|pos)}{P(w_i|neg)}$$

In the next section, we'll look at how to use log-likelihoods to implement Naive Bayes.

## Training Naive Bayes

Now let's look at how to train a Naive Bayes classifier.

As opposed to other classifiers like logistic regression or deep learning techniques, there is no gradient descent used in training—instead we are simply counting frequencies of words in a corpus.

Below are the six steps to train a Naive Bayes classifier for sentiment analysis.

**Step 1: Collect and annotate the ****corpus**

The first step in any supervised machine learning project is to collect and annotate the data for training and testing.

In the context of sentiment analysis, this involves collecting a corpus of tweets and annotating them as either positive or negative.

**Step 2: Data Preprocessing**

The next step is to preprocess the data, which includes the five steps below:

- Making all letters lowercase
- Removing punctuation, URLs, and names
- Removing stop words
- Stemming
- Tokenizing sentences

**Step 3: Word Count**

The next step is to produce a table containing the vocabulary of each word and class. This will also include summing the words and classes in each corpus.

**Step 4: Conditional Probability**

With the table of frequencies, we can then get the conditional probability using the Laplacian smoothing formula.

**Step 5: Lambda Score**

The next step is to get the Lambda score for each word, which is the log of the ratio of the conditional probabilities.

**Step 6: Get the Log Prior**

Finally, we can get the estimation of the log prior by counting the number of positive and negative tweets. We then get the log prior by taking the log of the ratio of the number of positive tweets over the negative tweets:

$$logprior = log\frac{D_{pos}}{D_{neg}}$$

## Testing Naive Bayes

Now that we've trained the model, we need to test it using a validation set to compute the model's accuracy.

In order to test the model, we take the conditional probabilities just derived and used them to predict the sentiment of unseen tweets.

From the previous step, we already have a table of Lambda scores for each unique word in the vocabulary.

Using these Lambda scores and the estimation of the log prior we can predict the sentiment of new tweets.

Below is the process to test the accuracy of the $Y_val$ dataset, which is represented by the $\lambda$ and log prior with this unseen data:

- First, we compute the score of each entry: $score = predict(X_val, \lambda, logprior)$
- Next, we evaluate whether the score is greater than or less than 0, which produces a vector with 0's and 1's indicating if the tweet is negative or positive
- With the predictions vector we compute the accuracy of the model over the validation set as follows:

$$\frac{1}{m} \sum^m_{i=1}(pred_i == Y_{val_i})$$

In this step, we compare the predictions to the true value for each observation in the validation set. This will return a set of values of 1 for correct and 0 for incorrect.

We can then use this to compute the accuracy by summing this vector and dividing by the number of examples in the validation set.

## Summary: Sentiment Analysis with Naive Bayes

In this article, we've looked at using Naive Bayes to classify the sentiment of tweets.

What we're doing with the Naive Bayes formula is estimating the probability of each class using the joint probability of the words in the classes, which can be applied to many more use cases than just sentiment analysis.

A few other examples of applications of Naive Bayes is spam detection, information retrieval, and word disambiguation, or breaking down words for contextual clarity.

In summary, Naive Bayes is an probabilisitic model that is relatively simply to train, use, and interpret.