Sentiment Analysis (Part 1): Finetuning DistilBert for Text Classification and Hosting it on Hugging Face.
I. Introduction
1.1 Sentiment Analysis
a. Overview of Sentiment Analysis
Sentiment Analysis is a process of analyzing and extracting emotions, opinions, and attitudes expressed in a piece of text. It is a branch of Natural Language Processing (NLP) that aims to understand the sentiment of a given text, whether it is positive, negative, or neutral. It is widely used in various fields, such as marketing, finance, politics, and customer service, to gain insights into customer feedback and opinions.
b. Types of Sentiment Analysis
Sentiment Analysis involves the use of machine learning algorithms to classify the sentiment of the text into positive, negative, or neutral. There are three types of Sentiment Analysis:
- Fine-grained Sentiment Analysis: This type of Sentiment Analysis classifies the sentiment of the text into multiple categories, such as positive, negative, and neutral, allowing for more nuanced analysis.
- Binary Sentiment Analysis: This type of Sentiment Analysis classifies the sentiment of the text into only two categories, positive or negative.
- Emotion Detection: This type of Sentiment Analysis involves identifying the emotions expressed in a piece of text, such as happiness, sadness, anger, and more.
1.2 Huggingface
a. Overview of Huggingface
Huggingface is a popular open-source library for NLP that offers a variety of pre-trained models for Sentiment Analysis and other NLP tasks, along with tools for fine-tuning these models on specific datasets. Its user-friendly interface, powerful API, and extensive documentation have made it popular in the NLP community.
b. Importance of Huggingface in NLP
Huggingface has played a significant role in making NLP more accessible to developers and researchers. Its pre-trained models, including those for Sentiment Analysis, have reduced the time and resources needed to train models from scratch, allowing for faster and more efficient experimentation. Huggingface provides a range of tools and APIs for fine-tuning these pre-trained models on specific datasets, making it easier to achieve state-of-the-art performance.
2. Finetuning a Text Classification Model on Huggingface
Proper data preparation is essential before finetuning a Sentiment Analysis model on Huggingface. This involves selecting a relevant and diverse dataset, ensuring it is preprocessed and formatted correctly, and using techniques such as stop words removal and lemmatization to improve dataset quality. A large and diverse dataset is crucial for effective model training.
Why should you use Google Colab?
Google Colab is an excellent solution for NLP projects, especially for those without GPUs. It provides cloud-based computation, including GPUs and TPUs, for efficient training and running of computationally-intensive NLP models. With its free and convenient platform, there’s no need for costly local infrastructure. Colab enables seamless collaboration and sharing, integrates with Google services, and offers pre-installed libraries. It’s an ideal choice for interactive NLP development, empowering users to overcome hardware limitations and access necessary computational resources.
Note: If your machine supports GPU or TPU, you can configure Jupyter notebook to utilize these resources. Otherwise, I recommend using Google Colab as an alternative.
2.1 Read, Split and Prepare data
To prepare data for training or testing a model, it’s important to first ensure the dataset is complete and clean. This will help inform the next steps, including data preprocessing.
Install the libraries and dependencies
We will install the libraries and dependencies that are not readily available in google colab.
# install dependencies
!pip install transformers simpletransformers nltk
Import modules
Let’s import some of the modmodulesat will be used in the project.
import os
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split
Read Dataset
Since we are using an external dataset let’s use pandas’ read_csv() method to read the dataset into the notebook.
# Load the dataset and display some values
df = pd.read_csv('/content/Natural-Language-Processing-Project/zindi_challenge/data/Train.csv')
Explore Data
Exploring data involves analyzing and visualizing the data to gain insights and understanding of its structure, patterns, and relationships, which can help in making informed decisions and building effective models.
i. Check for null values
# check null values
df.isnull().sum()
ii. Check for label count.
#let's count number of label in the data
df['label'].value_counts()
0.000000 4908
1.000000 4053
-1.000000 1038
0.666667 1
Name: label, dtype: int64
From the output we can see that there are 4 labels in the dataset: 0, -1, 1 and 0.666667 which appeared just once. Clearly this is not a label and needs to be cleaned.
iii. Drop null values and confirm
# check null values
print(df.isnull().sum())
# label again
df['label'].value_counts()
tweet_id 0
safe_text 0
label 0
agreement 0
dtype: int64
0.0 4908
1.0 4053
-1.0 1038
Name: label, dtype: int64
After dropping the null values, we can see that the row with the label 0.66667 has been dropped. In this case there will be no further cleaning of the labels.
Split data
We will have to split data into train and evaluation data and confirm split percentage. The train set would be used to train the model and the evaluation set would be used to evaluate the model.
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")
new dataframe shapes: train is (7999, 4), eval is (2000, 4)
Save Datasets
Let’s save the split parts of the dataset to csv files for future use.
# Save splitted subsets
train.to_csv("train_subset.csv", index=False)
eval.to_csv("eval_subset.csv", index=False)
2.2 Loading and preprocessing data
Two modules need to be imported to prepare the data for the model, including loading the data into a format that the model can understand and preprocessing the data.
load_datasets: is a function in Huggingface that allows loading and accessing various datasets for natural language processing tasks.
tokenizers: is a feature that automatically selects and applies the appropriate tokenizer for a given pre-trained model.
Loading Dataset
The load_dataset function is used to load the train and evaluation subsets of the data.
dataset = load_dataset('csv',
data_files={'train': 'train_subset.csv',
'eval': 'eval_subset.csv'}, encoding = "ISO-8859-1"
Convert all labels to non-negative numbers and remove unwanted columns
This function changes -1 labels (Negative) to 0, 0 labels (Neutral) to 1 and 1 labels (Positive) to 2.
#create a function to convert label
def transform_labels(label):
label = label['label']
num = 0
if label == -1: #'Negative'
num = 0
elif label == 0: #'Neutral'
num = 1
elif label == 1: #'Positive'
num = 2
return {'labels': num}
# Transform labels and remove the useless columns
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)
Using Tokenizers
Tokenizers refer to the process of converting a sequence of text into individual units, or tokens, which can then be used for various natural language processing (NLP). Tokenizers split text into words, sub words, or characters. You can call the specific tokenizer for a particular using the AutoTokenizer class
# let's train a Distilbert model
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncase')
# let's tokenize the data for the model to be able to understand
def tokenize_data(example):
return tokenizer(example['safe_text'], padding='max_length')
Let’s use the map() method from the load_dataset instance to tokenize the whole dataset.
dataset = dataset.map(tokenize_data, batched=True)
After tokenizing the dataset, the feature in the train data looks like:
{'tweet_id': Value(dtype='string', id=None),
'safe_text': Value(dtype='string', id=None),
'label': Value(dtype='float64', id=None),
'agreement': Value(dtype='float64', id=None),
'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}
2.3 Training and Hyperparameter tuning
To finetune a pre-trained Sentiment Analysis model using Huggingface’s tools, the prepared dataset needs to be trained on the selected model architecture. The model is then optimized using hyperparameters to achieve optimal performance and saved for deployment on Huggingface’s servers. This finetuned model can be easily integrated into workflows and other applications using Huggingface’s API for Sentiment Analysis tasks.
Choosing the right Model Architecture
Huggingface provides pre-trained Sentiment Analysis models with varying complexity, and the choice of model is critical for performance based on available resources. It is recommended to start with simpler models like DistilBERT, and then progress to more complex models like BERT or RoBERTa if needed.
For this project, the chosen model to finetune is “distilbert-base-uncase” from the “Distilbert model” class.
from transformers import AutoModelForSequenceClassification
# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncase", num_labels=3)
Setting Training Arguments
In order to fine-tune a pre-trained model on a specific dataset using Huggingface’s training tools, it’s important to configure the hyperparameters and settings, which is referred to as setting training arguments. These arguments, such as the number of epochs, learning rate, batch size, and evaluation metrics, play a significant role in improving the accuracy and performance of the model.
# let's set the training arguements
# the default batch size for training arguments
batch_size = 8
# set number of epochs
number_of_epochs = 7
# let set the logging steps
logging_steps = len(dataset['train']) // batch_size # it should log each batch
steps = (len(dataset['train']) / batch_size) * number_of_epochs
warmup_steps = int(0.2 * steps)
from transformers import TrainingArguments
training_args = TrainingArguments(
num_train_epochs=number_of_epochs,
load_best_model_at_end=True,
evaluation_strategy='steps',
save_strategy='steps',
learning_rate=2e-5,
logging_steps=logging_steps,
warmup_steps= warmup_steps,
save_steps=1000,
eval_steps=500,
output_dir="fine-tuned-distilbert-base-uncased"
)
Training
In this particular project, instead of writing a custom loop function to train the model, we will utilize the Trainer object from the huggingface transformers library.
i. Let’s shuffle the data
# shuffle the datasets
train_dataset = dataset['train'].shuffle(seed=10)
eval_dataset = dataset['eval'].shuffle(seed=10)
ii. Let’s create a Trainer Instance
from transformers import Trainer
trainer = Trainer(
model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset
)
Let’s call the Train() method on the Trainer instance created to start the training.
# Launch the learning process: training
trainer.Train()
ii. Setting Metrics and Evaluating the Model
During model evaluation, selecting appropriate evaluation metrics such as accuracy and F1 score is important. With Huggingface’s built-in evaluation metrics, they can be easily added to the Trainer object to track model progress and performance. Using both accuracy and F1 score is essential in this project as accuracy alone may not provide a complete understanding of model performance on imbalanced datasets. F1 score, which considers precision and recall, provides a better understanding of the model’s ability to identify both positive and negative sentiment.
Let’s create our evaluation metrics
import numpy as np
from sklearn.metrics import mean_squared_error
from datasets import load_metric
def compute_metrics(eval_pred):
# load the metrics to use
load_accuracy = load_metric("accuracy")
load_f1 = load_metric("f1")
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
# calculate the mertic using the predicted and true value
accuracy = load_accuracy.compute(predictions=predictions, references=labels)
f1 = load_f1.compute(predictions=predictions, references=labels, average="weighted")
return {"accuracy": accuracy, "f1score": f1}
Let’s create another Trainer instance, this time to evaluate the model
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
Let’s run the evaluate() on the Trainer instance to evaluate the model
# Launch the final evaluation
trainer. Evaluate()
Output from evaluating the model:
{'eval_loss': 0.7857925891876221,
'eval_accuracy': {'accuracy': 0.779},
'eval_f1score': {'f1': 0.7772241749565737},
'eval_runtime': 31.3162,
'eval_samples_per_second': 63.865,
'eval_steps_per_second': 7.983}
After evaluating the model, we achieved an accuracy of 0.779 and an f1 score of 0.7772241749565737
3. Deploying the Model to Hugging Face
Deploying the model to Hugging Face involves saving the fine-tuned model and uploading it to the Hugging Face Model Hub. This allows other users to easily access and use the model through Hugging Face’s API. Once the model is uploaded, it can be shared with others and integrated into various applications and workflows. The Hugging Face Model Hub also provides version control and allows for model updates and improvements over time.
3.1 Create a Model Repository on Hugging Face
Head to the hugging face homepage. Click on Signup and follow the instructions to create a hugging face account. Click on your profile navigate to and click on Settings to create a Access Token. This will later be used to login into your account to access a model repository.
Now Let’s create a model repository. Navigate to your profile and click “New Model” and create your repository
3.2 Push Finetuned Model to Hugging Face
Before you push your model to your repository from your Colab notebook, you have to run the following commands.
# first install git
!apt-get install git -y
# This is to help save and cache your access token
!git config --global credential.helper store
# to login to hugging face
!huggingface-cli login
After running the above commands, a prompt pops up requesting you to enter your access token. Kindly go to your Hugging face account, navigate to ‘Access Tokens’, copy your token and paste it into the prompt. This gives you access to your repository. We can now push our model and tokenizer to hugging face by running in your notebook:
# push your model and tokenizer to hugging face. This code even creates a model card for you.
trainer.push_to_hub()
Once you have run this code you can check your model repository to confirm changes. This model can be used by anyone if you set your repository to ‘public’ in the course of its creation.
It is best practice to train more than one model (Distilbert, Bert, Roberta) when working on a project so you can choose which one performed best. The accuracy and f1 score of the model can be improved by training the model on a set of different datasets and also by tuning the parameters.
The notebook containing the code is available on my GitHub page. I welcome your thoughts, feedback, and suggestions for improvement.