In this article, we'll discuss several core concepts of natural language processing (NLP) for sentiment analysis including classification, logistic regression, and vector spaces

This article is based on notes from the first course in this Natural Language Processing Specialization and is organized as follows:

- Supervised Machine Learning & Logistic Regression
- Feature Extraction & Vocabulary
- Positive and Negative Frequencies
- Feature Extraction Using Frequencies
- Preprocessing Text Data
- Creating a Matrix of Features
- Logistic Regression Overview
- Logistic Regression Training
- Logistic Regression Testing

*This post may contain affiliate links. See our **policy page** for more information.*

## Supervised Machine Learning & Logistic Regression

To start, let's review the concept of supervised machine learning and logistic regression.

In supervised machine learning, you have input features $X$ and sets of labels $Y$.

In order to achieve accurate predictions, the goal is to minimize the error rate or cost function of the model. To do this, we use a prediction function $\theta$ that takes in parameter data in order to map features $X$ to output labels $Y$.

To map features to labels, we need to minimize the difference between the expected values $Y$ and the predicted values $\hat{Y}$.

The cost function works by comparing how close the predicted value $\hat{Y}$ is to $Y$ and then updates the parameters accordingly. This process is then repeated until the cost function is minimized.

In the context of sentiment analysis, let's say we are tasked with determining if a tweet contains positive or negative sentiment. To do this, we can use a logistic regression classifier in which positive tweets are labeled as 1 and negative as 0.

The first step is to process raw tweets from the training dataset and extract useful features.

The logistic regression classifier will then be trained while minimizing the cost function. After training, it can then be fed new tweets and used for predictions.

Next, we'll look at exactly how we can extract features from the training data.

## Feature Extraction & Vocabulary

In order to extract features from text, we first need to represent the text as a vector. In addition, we'll need to build a vocabulary to encode any text as an array of numbers.

To build the vocabulary, we need to go through all the words from the text and save any new words that appear.

With this vocabulary, we can then assign a value of 1 if the word appears in the tweet and 0 if not. As you can imagine, there will be a large number of 0's in the vocabulary and just a few 1's.

This type of representation of a small number of non-zeros is referred to as **sparse representation.**

The issue with sparse representation is that a logistic regression would need to learn $n + 1$ parameters, where $n$ is the size of the vocabulary.

$$[\theta_0, \theta_1, \theta_2, ...\theta_n]$$

where:

- $n = |V|$

With a large enough vocabulary, this would take an excessive amount of time to train the model and even more time to make predictions. In the next section, we'll look at several ways to address this challenge.

## Positive and Negative Frequencies

In this section, we'll discuss how to generate counts that can be used as features in the logistic regression classifier. In particular, we will track how many times a given word shows up in the positive vs. negative sentiment class.

To do this, we start by taking the number of unique words in the corpus.

We then take the set of positive tweets and use that to get the positive frequency, which is the number of times a particular word appears in the tweets.

For example, if we have two tweets and the word "happy" shows up in each of them, this would word would have a positive frequency of 2.

Next, we apply this same process to the negative tweets in order to get the negative frequency count.

In practice, this table of positive and negative frequencies is a dictionary mapping a word class to its frequency.

Next, we will use this frequency dictionary to represent a tweet.

## Feature Extraction Using Frequencies

In this section, we'll look at how to encode a tweet that is represented as a vector with 3 dimensions.

This will then speed up the logistic regression classifier, as instead of needing to learn $V$ features, the model only needs to learn 3 features.

We saw that the frequency of a word in a class is simply a dictionary mapping word-class pairs to frequencies, which in turn tells us how many times each word appeared in a given class.

We can now use this dictionary to extract features for sentiment analysis.

If we have an arbitrary tweet $m$, we can extract features with the following equation:

$$X_m = [1, \sum_w freqs(w,1), \sum_w freqs(w,0)]$$

where:

- $X_m$ is the features of tweet $m$
- 1 is the bias
- $\sum_w freqs(w,1)$ is the sum of positive frequencies for each unique word in tweet $m$
- $\sum_w freqs(w,0)$ is the sum of negative frequencies for each unique word in tweet $m$

Now that we know how to represent a tweet as a vector of dimension 3, we can now pre-process tweets and use these pre-processed words as words in the vocabulary.

## Preprocessing Text Data

Two important concepts in preprocessing text data including **stemming** and **stop words**.

The first step to preprocess text data is to remove all the words and punctuation that don't add meaning to the sentence, otherwise known as stop words. Examples of stop words include "and", "is", "for", and so on.

Every word in the tweet that appears in the list of stop words should then be removed. Punctuation should also be removed, although there are times when it does add sentiment to the text, in which case it doesn't need to be removed.

What is left in the tweet is all the importnat information to determine the sentiment.

After the tweets have been preprocessed to only include important information, the next step is to perform stemming.

Stemming in natural language processing is the process of transforming a word to its base stem, which is defined as the set of characters used to construct the word and its derivatives.

Through the process of stemming, we're able to reduce the size of vocabulary significantly.

Next, we'll use this preprocessed text data to extract a matrix $x$, which will represent all the tweets in the dataset.

## Creating a Matrix of Features

In this section, we'll create a matrix of features corresponding to the training data.

In particular, we start by proprocessing a tweet to get a list of words that contain information about the sentiment.

We then take this list of words and create a frequency dictionary mapping.

Finally, we get a vector with a bias unit and two additional features to store the sum of the positive and negative frequencies.

Since we would need to perform this for $m$ tweets, we would end up with a matrix $X$ with $m$ rows and 3 columns where each row would contain features for one of the tweets.

Below is a general implementation of how we can do this in Python:

```
# build frequencies dictionary
freqs = build_freqs(tweets, labels)
# initialize matrix X
X = np.zeroes((m, 3))
# loop through tweets
for i in range(m):
p_tweet = process_tweet(tweets[i]) # process tweet
X[i,:] = extract_features(p_tweet, freqs) # extract features
```

Keep in mind the helper functions `build_freqs`

, `process_tweet`

, and `extract_features`

will still need to be built separately.

## Logistic Regression Overview

In this section, we'll use the extracted features to predict the sentiment of a tweet.

Logistic regression is useful for this as it uses a sigmoid function to output a probability between zero and one.

Recall that in supervised machine learning we have input features $X$ and a set of labels $Y$.

In order to make predictions, we need a function with parameters $\theta$ to map features to output labels $\hat{Y}$.

To optimize the mapping of features to labels, we minimize the cost function by comparing how close the output $\hat{Y}$ is to the real labels $Y$ from the data.

After this, the parameters are updated and the process is repeated until the cost function is minimized to a satisfactory level.

With logistic regression, the function $F$ is equal to the sigmoid function.

In particular, the function use make predictions in logistic regression $h$ is the sigmoid function that depends on parameters $\theta$ and the features vector $x^i$, where $i$ denotes the $i$th observation:

$$h(x^{(i)}, \theta) = \frac{1}{1+e^{-\theta^T x^{(i)}}}$$

Now that we have the notation for logistic regression, we can use it to train a weight vector $\theta$.

## Logistic Regression Training

In order to train the logistic regression classifier, we need to iterate until we find a set of parameters $\theta$ that minimizes the cost function.

If, for example, the loss function only depends on $\theta_1$ and $\theta_2$, below are the steps we need to take to train it:

- First, we need to initialize the parameter $\theta$
- We then use the logistic function to get values for each observation $h = h(X, \theta)$
- We then calculate the gradient $\triangledown = \frac{1}{m}m X^T(h - y)$
- Next, we update the parameters $\theta = theta - \alpha\triangledown$
- Finally, we compute the loss function $J(\theta)$ and determine if more iterations are needed according to a stop-parameter or maximum number of iterations

Now that we have the $\theta$ variable, we want to test how accurate the model is.

## Logistic Regression Testing

In this section, we'll look at how we can compute the accuracy of the model.

To test the model we will need $X_{val}$ and $Y_{val}$, otherwise known as validation sets, and $\theta$.

First, we compute the sigmoid function for $X_{val}$ with parameters—$\theta$ $$h(X_{val}, \theta)$$—and then evaluate if each value of $h$ of $\theta$ is greater than a threshold, typically 0.5:

$$pred = h(X_{val}, \theta)\geq 0.5$$

After this process will will have a vector of zeros and ones indicating negative and positive examples.

With this predictions vector we can then compute the accuracy of the model against the validation sets. To do so, we compare the predictions with the true value of the observation.

After comparing each prediction with true labels we can get the total accuracy by summing the vector of comparisons and dividing by $m$:

$$\sum^m_{i=1} \frac{(pred^{(i)} == y^{(i)}_{val})}{m}$$

This metric provides an estimate of how accurate the logistic regression model will be on unseen data.

## Summary: Sentiment Analysis with Logistic Regression

In this article, we first discussed how to preprocess text data for the purpose of sentiment analysis, particularly classifying tweets as either positive or negative.

We then looked at how to extract useful features from text data and how to use these features to train a logistic regression model.

Finally, we looked at how to test the accuracy of the model and estimate how well it will perform on unseen data.