Customer Review Analysis - Ambolt
Customer Review Analysis
use case demo

Overview

In this guide we will show you how to automatically analyse how positive or negative customer reviews are.

Problem

You have a large dataset of written customer reviews, which can be obtained from for example Twitter or Amazon, or from questionnaires or optional user forms. For each such review, you would like to know whether the review is generally favorable or not, but doing so manually would be time-consuming and expensive. This is known as sentiment analysis.

Result

By following this guide, you will be able to automatically score how positive and how negative a given review is.

Compatible Emily version

This guide was made with Emily version 0.2.4.

Recommended theory

If you want to know more about the theory behind this guide, you can read up on some of the following terms.

Data

We will use a large dataset of movie reviews generated by users from the website IMDB. Each of these reviews is labeled either positive or negative, depending on whether the review is favorable or not.

How to get it

The dataset can be downloaded from Kaggle here.

Data samples

Here are some samples from the dataset, to give you an idea of the kind of data you will be working with.

review

sentiment

If you like original gut wrenching laughter you will like this movie. If you are young or old then you will love this movie, hell even my mom liked it.<br /><br />Great Camp!!!

positive

Besides being boring, the scenes were oppressive and dark. The movie tried to portray some kind of moral, but fell flat with its message. What were the redeeming qualities?? On top of that, I don’t think it could make librarians look any more unglamorous than it did.

negative

It’s terrific when a funny movie doesn’t make smile you. What a pity!! This film is very boring and so long. It’s simply painfull. The story is staggering without goal and no fun.<br /><br />You feel better when it’s finished.

negative

I have seen this film at least 100 times and I am still excited by it, the acting is perfect and the romance between Joe and Jean keeps me on the edge of my seat, plus I still think Bryan Brown is the tops. Brilliant Film.

positive

Implementation

The project is implemented using the Emily platform.

Download or create project

The project can be downloaded here. Import the project with `emily import `.

 

If you want to create the project from scratch, type `emily build ml-api` to create a project with a blank machine learning template.

Data preparation

The dataset should be split into a training part and a test part to ensure that the accuracy of the model isn’t tested on reviews it has already seen during training. To do this, you can use the following python script. If you save the dataset as `dataset.csv` and run this script in the same folder as the dataset, you will get two new files: `train_data.csv` and `test_data.csv`.

import pandas as pd
from sklearn.model_selection import train_test_split
 
data = pd.read_csv("dataset.csv")
 
train, test = train_test_split(data, test_size=0.2)
 
train.to_csv(r'train_data.csv', index=False)
test.to_csv(r'test_data.csv', index=False)

Emily API

We will now go through the code that is necessary to implement in the Emily project to have a fully functioning machine learning API.

model.py

We first define the machine learning model that we will use, which is a naive Bayes classifier. We do this by importing `MultinomialNB` from `sklearn.naive_bayes` and making the `Model` class inherit from `MultinomialNB`. We also instantiate a `TfidfVectorizer` from the `sklearn.feature_extraction.text` library, which implements term frequency-inverse document frequency (TF-IDF) calculations. TF-IDF is a measure of how impactful a word is in a collection of text files, such that frequently used words like “good” have less impact than less frequently used words like “wonderful”, even though they both express positive sentiment. The TF-IDF values will be used as input to the naive Bayes model.

from sklearn.naive_bayes import MultinomialNB
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
 
 
class Model(MultinomialNB):
 
   def __init__(self):
       super().__init__()  # Inherit methods from the super class which this class extends from
       self.vectorizer = TfidfVectorizer()

Next we implement the methods of the model class. The `forward` method makes a prediction from the given sample by calling the `predict` method inherited from the `MultinomialNB` class. The `save_model` and `load_model` classes respectively save and load a model using the `pickle` library. The `load_model` method also updates the current instantiation of the model with the newly loaded model.

   def forward(self, sample):
       return self.predict(sample)
 
   def save_model(self, save_path):
       with open(save_path, 'wb') as fp:
           pickle.dump(self, fp)
 
   def load_model(self, model_path):
       with open(model_path, 'rb') as fp:
           model = pickle.load(fp)
           self.__dict__.update(model.__dict__)
 
   def __call__(self, sample):
       return self.forward(sample)
trainer.py

The `Trainer` class is responsible for the training of a model. An important part of the training is to preprocess the text. To do this, we make use of the natural language toolkit library `nltk`, and download a list of stopwords that will be useful later.

from ml.model import Model
import pandas as pd
import nltk
from nltk.stem.porter import PorterStemmer
import re
from tqdm import tqdm  # Wraps iterables and prints a progress bar
import os
import pickle
 
nltk.download('stopwords')
 
 
class Trainer:
   """
   The Trainer class is used for training a model instance based on the Model class found in ml.model.py.
   In order to get started with training a model the following steps needs to be taken:
   1. Define the Model class in ml.model.py
   2. Prepare train data on which the model should be trained with by implementing the _read_train_data() function and
   the _preprocess_train_data() function
   """
 
   def __init__(self):
       # creates an instance of the Model class (see guidelines in ml.model.py)
       self.model = Mode

The `train` method first loads the training data and preprocesses it using the `_load_train_data` and `_preprocess_train_data` methods. The model can not work with text directly, so instead the data is transformed into TF-IDF values. These are computed using the TF-IDF vectorizer. Finally, the model is fit to the training data and the labels, and the resulting model is saved.

def train(self, request):
       """
       Starts the training of a model based on data loaded by the self._load_train_data function
       """
 
       # Unpack request
       dataset_path = request.dataset_path
       save_path = request.save_path
 
       # Read the dataset from the dataset_path
       train_data = self._load_train_data(dataset_path)
 
       # Preprocess the dataset
       preprocessed_train_data, labels = self._preprocess_train_data(
           train_data, './data/preprocessed.pickle')
 
       # Fit the model
       X = self.model.vectorizer.fit_transform(preprocessed_train_data)
       y = labels
       self.model.fit(X, y)
 
       # Save the trained model
       self.model.save_model(save_path)
 
       return True

The `_load_train_data` method simply uses the `pandas` library to read data from the given csv-file.

   def _load_train_data(self, dataset_path):
       with open(dataset_path) as fp:
           dataset = pd.read_csv(fp).values
 
       return dataset

The `_preprocess_train_data` method first checks if the data has already been preprocessed and saved, and simply returns the saved preprocessed data if so. This is because the steps taken to preprocess the data can be quite slow, so it is advantageous to not repeat it unless necessary. If the data has not already been preprocessed and saved, however, the method goes through each review one at a time and normalizes the text in the review. When all the text has been normalized, it is saved using the `pickle` library.

   def _preprocess_train_data(self, train_data, preprocess_data_path):
       if os.path.isfile(preprocess_data_path):
           with open(preprocess_data_path, 'rb') as fp:
               print(f'Loading {preprocess_data_path}...')
               return pickle.load(fp)
 
       data = []
       labels = []
       for review, sentiment in tqdm(train_data, '[Preprocessing training data...]'):
           data.append(self._normalize_text(review))
           labels.append(sentiment)
 
       preprocessed = (data, labels)
 
       with open(preprocess_data_path, 'wb') as fp:
           print(f'Writing preprocessed data to {preprocess_data_path}')
           pickle.dump(preprocessed, fp)
 
       return preprocessed

The `_normalize_text` method performs the normalization of the text included in the reviews. First the entire text is set to lower-case, and leftover HTML tags are removed. Then the `nltk` library is used to tokenize the text, which just means that the text is broken up into smaller parts called tokens, such as the sentence “This is a sentence” being split into “This”, “is”, “a”, and “sentence”. The rest of the method then works on these tokens instead of the entire text string. After this, the `_negate` and `_normalize_words` methods are called, and finally the tokens are joined back together to form a string, which is then returned.

   def _normalize_text(self, text):
       text = text.lower()
       text = text.replace("<br />", "")
       words = nltk.word_tokenize(text)
       words = self._negate(words)
       words = self._normalize_words(words)
       result_words = ' '.join(words)
       return result_words

The `_negate` method explicitly marks words that are negated, so that for example the word “good” has a different meaning if preceded by the word “not”. It looks for negating words, such as “isn’t” or “not”, and marks every word following such a negating word with “_NEG” until it sees a punctuation mark.

   def _negate(self, words):
       negated = []
       is_negated = False
       clause_punctuation_regex = re.compile(r"^[.:;!?]$")
       negating_word_regex = re.compile(r"(?:^(?:never|no|nothing|nowhere|noone|none|not|havent|hasnt|hadnt|cant|"
                                        r"couldnt|shouldnt|wont|wouldnt|dont|doesnt|didnt|isnt|arent|aint)$)|n't")
 
       for word in words:
           if clause_punctuation_regex.match(word):
               if is_negated:
                   is_negated = False
               continue
 
           elif negating_word_regex.match(word):
               is_negated = True
 
           negated.append(f'{word}{"_NEG" if is_negated else ""}')
       return negated

Finally, the `_normalize_words` method removes so-called stopwords, which are words such as “the”, “a”, or “is”, that are so common that they carry little meaning on their own. The method also performs stemming on each word, which means that it reduces every word to its stem, so that for example “watch”, “watches”, “watching”, and “watched” are all reduced to the word “watch”, and therefore identified as the same word. This is all done using tools from the `nltk` package.

   def _normalize_words(self, words):
       stopwords = set(nltk.corpus.stopwords.words('english'))
       porter = PorterStemmer()
 
       normalized = []
       for word in words:
           if word in stopwords:
               continue
           normalized.append(porter.stem(word))
 
       return normalized
 
   def __call__(self, request):
       return self.train(request)
evaluator.py

Once a model has been trained, its performance should be evaluated. This is handled by the `Evaluator` class, which evaluates a given model on the test data.

from ml.model import Model
import pickle
from tqdm import tqdm
import os
import nltk
from nltk.stem.porter import PorterStemmer
import re
import pandas as pd
import numpy as np
 
 
class Evaluator:
   """
   The Evaluator class is used for evaluating a trained model instance.
   In order to get started with evaluating a model the following steps needs to be taken:
   1. Train a model following the steps in ml.trainer.py
   2. Prepare test data on which the model should be evaluated on by implementing the _read_test_data() function and
   the _preprocess_test_data function
   """
 
   def __init__(self):
       self.model_path = ""
       self.model = None

The main part of the `Evaluate` class is the `evaluate` method. This method first loads the model from the given model path, and then loads the test data from the given data path. Afterwards, the test data is preprocessed in exactly the same way as was done in the `Trainer` class. Then the test data is transformed into TF-IDF values using the TF-IDF vectorizer from the model. Finally, the test data and the correct labels are passed to the `score` method on the model, which returns the accuracy of the predictions. This accuracy is then returned as the final evaluation score.

def evaluate(self, request):
       """
       Evaluates a trained model located at 'model_path' based on test data from the self._load_test_data function
       """
 
       # Unpack request
       dataset_path = request.dataset_path
       model_path = request.model_path
 
       # Loads a trained instance of the Model class
       # If no model has been trained yet proceed to follow the steps in ml.trainer.py
       if model_path != self.model_path:
           self.model = Model()
           self.model.load_model(model_path)
           self.model_path = model_path
 
       # Read the dataset from the dataset_path
       test_data = self._load_test_data(dataset_path)
 
       # Preprocess dataset to prepare it for the evaluator
       test_dataset, labels = self._preprocess_test_data(
           test_data, './data/preprocessed.pickle')
 
       # Evaluate model
       test_dataset = self.model.vectorizer.transform(test_dataset)
 
       actual = np.array(labels).reshape(-1, 1)
       score = self.model.score(test_dataset, actual)
 
       return score
 
   def _load_test_data(self, dataset_path):
       with open(dataset_path) as fp:
           dataset = pd.read_csv(fp).values
 
       return dataset
 
   def _preprocess_test_data(self, test_data, preprocess_data_path):
       if os.path.isfile(preprocess_data_path):
           with open(preprocess_data_path, 'rb') as fp:
               print(f'Loading {preprocess_data_path}...')
               return pickle.load(fp)
 
       data = []
       labels = []
       for review, sentiment in tqdm(test_data, '[Preprocessing training data...]'):
           data.append(self._normalize_text(review))
           labels.append(sentiment)
 
       preprocessed = (data, labels)
 
       with open(preprocess_data_path, 'wb') as fp:
           print(f'Writing preprocessed data to {preprocess_data_path}')
           pickle.dump(preprocessed, fp)
 
       return preprocessed
 
   def _negate(self, words):
       negated = []
       is_negated = False
       clause_punctuation_regex = re.compile(r"^[.:;!?]$")
       negating_word_regex = re.compile(r"(?:^(?:never|no|nothing|nowhere|noone|none|not|havent|hasnt|hadnt|cant|"
                                        r"couldnt|shouldnt|wont|wouldnt|dont|doesnt|didnt|isnt|arent|aint)$)|n't")
 
       for word in words:
           if clause_punctuation_regex.match(word):
               if is_negated:
                   is_negated = False
               continue
 
           elif negating_word_regex.match(word):
               is_negated = True
 
           negated.append(f'{word}{"_NEG" if is_negated else ""}')
       return negated
 
   def _normalize_words(self, words):
       stopwords = set(nltk.corpus.stopwords.words('english'))
       porter = PorterStemmer()
 
       normalized = []
       for word in words:
           if word in stopwords:
               continue
           normalized.append(porter.stem(word))
 
       return normalized
 
   def _normalize_text(self, text):
       text = text.lower()
       text = text.replace("<br />", "")
       words = nltk.word_tokenize(text)
       words = self._negate(words)
       words = self._normalize_words(words)
       result_words = ' '.join(words)
       return result_words
 
   def __call__(self, request):
       return self.evaluate(request)
predictor.py

Now that the model has been trained and evaluated, we are ready to use the model to make predictions. This is done by the `Predictor` class.

from ml.model import Model
 
 
class Predictor:
   """
   The Predictor class is used for making predictions using a trained model instance based on the Model class
   defined in ml.model.py and the training steps defined in ml.trainer.py
   """
 
   def __init__(self):
       self.model_path = ""
       self.model = None

The `predict` method first loads the model from the given model path, and then preprocesses the sample input by transforming the sample into the corresponding TF-IDF values using the TF-IDF vectorizer from the model. Then the probabilities for each label class (positive/negative) are predicted by the model, and these probabilities are then postprocessed so they can be returned in a suitable format.

def predict(self, request):
       """
       Performs prediction on a sample using the model at the given path
       """
 
       # Unpack request
       sample = request.sample
       model_path = request.model_path
 
       # Loads a trained instance of the Model class
       # If no model has been trained yet proceed to follow the steps in ml.trainer.py
       if model_path != self.model_path:
           self.model = Model()
           self.model.load_model(model_path)
           self.model_path = model_path
 
       # Preprocess the inputted sample to prepare it for the model
       preprocessed_sample = self._preprocess(sample)
 
       # Forward the preprocessed sample into the model as defined in the __call__ function in the Model class
       prediction = self.model.predict_proba(preprocessed_sample)
 
       # Postprocess the prediction to prepare it for the client
       prediction = self._postprocess(prediction)
 
       return prediction

The `_preprocess` method transforms the input sample to its corresponding TF-IDF values through the TF-IDF vectorizer on the model which has been fit on the training data.

   def _preprocess(self, sample):
       return self.model.vectorizer.transform([sample])

Finally, the `_postprocess` method takes the prediction probabilities and puts them into a JSON object, which is a standard way to return information through an API.

   def _postprocess(self, prediction):
       return {
           'positive': prediction[0][1],
           'negative': prediction[0][0]
       }
 
   def __call__(self, request):
       return self.predict(request)

API test

You can now test that the API works as intended, using one of the methods described below. Before testing, make sure that you have started the API by running `python api.py` in a terminal inside the project in VS Code.

FastAPI docs

The FastAPI documentation can be accessed by going to http://localhost:4242/docs. Here you will see a list of all the API endpoints that are defined. Click on one of the endpoints, then click “Try it out” followed by “Execute” to test that endpoint.

curl

You can also test the API using the command-line tool curl.

 

Use the following command to test the train endpoint.

curl --header "Content-Type: application/json" --request POST --data "{\"dataset_path\":\"data/train_data.csv\",\"save_path\":\"data/model.pickle\"}" http://127.0.0.1:4242/api/train

You should get the following output.

{
    "result": true
}

Use the following command to test the evaluate endpoint.

curl --header "Content-Type: application/json" --request POST --data "{\"dataset_path\":\"data/test_data.csv\",\"model_path\":\"data/model.pickle\"}" http://127.0.0.1:4242/api/evaluate

You should get the following output.

{
    "result": 0.8416
}

Use the following command to test the predict endpoint.

curl --header "Content-Type: application/json" --request POST --data "{\"sample\":\"Great movie, loved every minute of it. The action scenes were great and the acting superb. Recommended for lovers of cinema\",\"model_path\":\"data/model.pickle\"}" http://127.0.0.1:4242/api/predict

You should get the following output.

{
    "result": {
        "positive": 0.6749371472538137,
        "negative": 0.3250628527461847
    }
}

Postman

Finally you can use the graphical tool Postman to test the API. For each request, make sure you choose “Body” and then “raw” as the parameter type.

 

To test the train endpoint, make a POST request to http://localhost:4242/api/train with the following parameters.

{
    "dataset_path": "data/train_data.csv",
    "save_path": "data/model.pickle"
}

You should get the following output.

{
    "result": true
}

To test the evaluate endpoint, make a POST request to http://localhost:4242/api/evaluate with the following parameters.

{
    "dataset_path": "data/test_data.csv",
    "model_path": "data/model.pickle"
}

You should get the following output.

{
    "result": 0.8416
}

To test the predict endpoint, make a POST request to http://localhost:4242/api/predict with the following parameters

{
    "sample": "Great movie, loved every minute of it. The action scenes were great and the acting superb. Recommended for lovers of cinema",
    "model_path": "data/model.pickle"
}

You should get the following output.

{
    "result": {
        "positive": 0.6749371472538137,
        "negative": 0.3250628527461847
       }
}

Improving the model

Here are some suggestions for things you can try modifying in order to improve the model.

  • The TF-IDF vectorizer has various parameters you can try tuning.
  • Similarly, the naive Bayes model also has some parameters that can be tuned.
  • You can try limiting the number of features used in the model, so that only the most important words are used for predicting. This may help against overfitting.
  • Finally, and most importantly, you can vary many parts of the text preprocessing. This includes using a different list of stopwords or negating words, or using a different stemming algorithm.

App

Lastly we will see how an app can call the API to give predictions to the user. You can download the app here. To use the app, first make sure that the API is running. Then open `index.html` in your favorite browser. You can now enter a review into the text box, and the bar below will update in real time showing how positive or negative the review is predicted to be.