Telco Customer Churn Prediction Using Machine Learning

Bright Eshun
21 min readMar 18, 2023

--

1. Introduction

The Telco industry faces a significant challenge of customer churn, which has a severe impact on business revenue. In this project, we aim to predict customer churn using machine learning models. The project will utilize the CRISP-DM framework to guide the process, starting with understanding the business problem and data, preparing the data, building and evaluating machine learning models, and assessing their impact on the business. The ultimate goal of this project is to provide actionable insights that can help businesses prevent customer churn and improve their overall performance.

2. Business Understanding

In this section, we will explore the Telco industry and its challenges related to customer churn. This section is critical for establishing the context of the project, defining the problem statement, and formulating initial hypotheses.

2.1 Defining the business problem

The Telco industry faces business problems like high customer churn rates and increased competition, which can be addressed by identifying key factors and building a predictive model. The Telco Customer Churn Prediction project aims to provide insights and recommendations to reduce churn rates and improve business performance.

2.2 Project Objectives and Succes Criteria

The following are the business objectives for this project:

  1. To build an accurate machine learning model that can predict customer churn in the Telco industry, with the goal of reducing customer churn rates and improving business revenue.
  2. To explore and compare the performance of various machine learning algorithms in predicting customer churn for a Telco company, and identify the most effective method for predicting customer churn.
  3. To identify the key factors that contribute to customer churn in the Telco industry and build a predictive model that can provide insights to the company on how to reduce customer churn.

The success criteria for the project will be determined as follows:

  1. The machine learning model should have a high accuracy rate in predicting customer churn, with a target accuracy of at least 75%.
  2. The model should have a high precision rate in identifying customers who are likely to churn, with a target precision of at least 75%.
  3. The model should have a high recall rate in identifying customers who have actually churned, with a target recall rate of at least 70%.
  4. The model should have a high F1 score and F2 score.

2.3 Hypothesis

Null Hypothesis: Customers churn based on Online Security, Tech Support and Total Charges
Alternate Hypothesis: Customers do not churn based on Online Security, Tech Support and Total Charges

2.4 Questions

The following are the questions I asked about the data:

  1. What is the distribution of tenure over the period, including the maximum and minimum?
  2. What are the average total charges for senior citizens and non-senior citizens?
  3. What are the average total charges for each gender?
  4. What is the number of monthly, yearly, and two-year contract customers who made payments using a particular method?
  5. What is the correlation between the churn column and the rest of the features?
  6. Do customers churn due to online security?
  7. What is the relationship between customer charges?
  8. Does the payment method contribute to why customers churn?
  9. How does the length of a customer’s contract affect their likelihood of churning?
  10. What is the relationship between monthly charges, total charges, and churn?
  11. On average, what is the total amount customers paid to enjoy a particular internet service, and how did it affect customer churn?
  12. What is the average tenure of senior citizens and non-senior citizens under a particular contract, and how does it affect customer churn?

3. Data Understanding

In this phase, we collect and describe the data, and explore it to identify any quality issues. We verify the quality of the data and address any issues we find. Additionally, we analyze the data for patterns and trends, ensuring that it is complete and accurate. This process is crucial to ensure that the data used to build the machine learning models is of high quality and suitable for the project’s objectives.

3.1 Collecting and describing the data

  1. The data used is data gathered from an unknow Telecommunication Company
  2. Below are columns in the data:
  • Gender — Whether the customer is a male or a female
  • SeniorCitizen — Whether a customer is a senior citizen or not
  • Partner — Whether the customer has a partner or not (Yes, No)
  • Dependents — Whether the customer has dependents or not (Yes, No)
  • Tenure — Number of months the customer has stayed with the company
  • Phone Service — Whether the customer has a phone service or not (Yes, No)
  • MultipleLines — Whether the customer has multiple lines or not
  • InternetService — Customer’s internet service provider (DSL, Fiber Optic, No)
  • OnlineSecurity — Whether the customer has online security or not (Yes, No, No Internet)
  • OnlineBackup — Whether the customer has online backup or not (Yes, No, No Internet)
  • DeviceProtection — Whether the customer has device protection or not (Yes, No, No internet service)
  • TechSupport — Whether the customer has tech support or not (Yes, No, No internet)
  • StreamingTV — Whether the customer has streaming TV or not (Yes, No, No internet service)
  • StreamingMovies — Whether the customer has streaming movies or not (Yes, No, No Internet service)
  • Contract — The contract term of the customer (Month-to-Month, One year, Two year)
  • PaperlessBilling — Whether the customer has paperless billing or not (Yes, No)
  • Payment Method — The customer’s payment method (Electronic check, mailed check, Bank transfer(automatic), Credit card(automatic))
  • MonthlyCharges — The amount charged to the customer monthly
  • TotalCharges — The total amount charged to the customer
  • Churn — Whether the customer churned or not (Yes or No)

3.2 Exploring the data and identifying data quality issues

This involves examining the data’s statistical properties, identifying missing values or outliers, and evaluating the accuracy and completeness of the data. By carefully exploring and understanding the data, we can identify and address any quality issues that may affect the accuracy and reliability of the model.

Importing Packages

Let’s import the libraries that will be used for the project.

# Data handling
import pandas as pd
import numpy as np

# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

# Feature Processing (Scikit-learn processing, etc. )
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import confusion_matrix , classification_report, f1_score, accuracy_score,\
precision_score, recall_score, fbeta_score, make_scorer
from sklearn.model_selection import train_test_split

# Other packages
import os

import warnings
warnings.filterwarnings("ignore")

Loading Datasets

Let’s load our datasets.

# For CSV, use pandas.read_csv
data=pd.read_csv('Telco-Customer-Churn.csv')

Exploratory Data Analysis

  1. Explore data
  2. Check for general properties and for null values
  3. Visualize and Analyze data

Let’s display the first 5 rows of the dataset.

#displaying max columns
# pd.set_option('display.max_columns', None)
data.head()

Let’s check for general information on the dataset.

#checking data information
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 gender 7043 non-null object
2 SeniorCitizen 7043 non-null int64
3 Partner 7043 non-null object
4 Dependents 7043 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7043 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 7043 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 7043 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7043 non-null object
18 MonthlyCharges 7043 non-null float64
19 TotalCharges 7043 non-null object
20 Churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

Let’s visualize and analyze the data to answer our questions.

What is the Distribution of Tenure over the period, the maximum and minimum?

# plot graph for  distribution off tenure
bins= np.arange(0, data['tenure'].max() +1, 1)
plt.hist(data=data, x='tenure', bins=bins)
plt.title('The Distribution of Tenure')
plt.xlabel('Tenure')
plt.show()
A graph representing the distribution of tenure over the period.

The distribution of the tenure for the customers is mainly between 5 months and 70 months. We had customers who were new and had used the service for just one month, and we also had people who had used the service for as long as 72 months. These were the unconditionally loyal customers.

What are the Average Total Charges for Senior Citizens and Non-Senior Citizens?

# group by seinor citizen and get mean
senior_mean = data.groupby(['SeniorCitizen'])['TotalCharges'].mean()
senior_mean

# plot a bar plot for average and senior citizens
plt.figure(figsize=(8, 5))
senior_mean.plot(kind='bar')
plt.ylabel('Total Charges')
plt.title('Average Total Charges by Senior Citizenry');
A graph representing the average total charges by senior citizenry

Surprisingly, the average total charges for males and females were very similar. The average sales for males were $2,283.41, while the average total charges for females were $2,283.19, a difference of only $0.23.

What are the Average Total Charges for each Gender?

# group by gender and get mean for total charges
gender_mean = data.groupby(['gender'])['TotalCharges'].mean()
gender_mean


gender
Female 2283.190985
Male 2283.407861
Name: TotalCharges, dtype: float64


# plot average total charges vs gender
plt.figure(figsize=(8, 5))
gender_mean.plot(kind='bar')
plt.ylabel('Total Charges')
plt.title('Average Total Charges by Gender');
A graph representing the average total charges for by gender.

Surprisingly, the average total charges for males and females were very similar. The average sales for males were 2,283.41, while the average total charges for females were 2,283.19, a difference of only 0.23.

What is the number of Monthly, Yearly and Two-Year contract customers who made payment using a particular method?

# group data frame by Paymentmethod and Contract
ct_counts = data.groupby(['PaymentMethod', 'Contract']).size()
ct_counts = ct_counts.reset_index(name='count') # the count as the index
# in order to plot a heatmap change it to pivot table
ct_counts = ct_counts.pivot(index='Contract', columns='PaymentMethod', values='count')


# plot a heatmap for the contract and payment method columns
plt.figure(figsize=(8, 5))
sns.heatmap(ct_counts,annot = True, fmt = 'd')
A heatmap showing the number of times customers under a particular used a payment method.

Most of the customers had monthly subscriptions, and many of them used payment methods such as bank transfer, electronic check, and mailed check. Specifically, 1,850 out of 2,365 customers who used electronic checks were on a monthly contract. These customers accounted for 4,875 out of 8,875 total month-to-month customers.

What is the correlation between the Churn column and the rest of the features?

data_copy = data.copy() # set a copy of the data frame

def categorical(data):
'''This function change data types objects into numerical codes'''
for col in data.columns:
if data[col].dtype == 'object':
data[col] = data[col].astype('category')
data[col] = data[col].cat. Codes

categorical(data_copy)
# get the correlation between all feature and churn column
churn_corr_df = data_copy.corr()[['Churn']].sort_values(by='Churn', ascending=False)



color = sns.color_palette()[0] # get the first color in the seaborn color palette
plt.figure(figsize=(8,5))
# create a barplot for the churn-correlation
sns.barplot(data=churn_corr_df, x=churn_corr_df.index, y='Churn', color=color)
plt.title('Correlation bewteen Churn and the rest of the Features')
plt.ylabel('Churn Correlation')
plt.xlabel('Column Features')
plt.xticks(rotation=90)
plt.show();
The corraltion between other features and the Churn.

Among the various features, MultipleLines, PhoneService, Gender, StreamingTV, StreamingMovies, and InternetService showed the least correlation with churn. Conversely, TechSupport, OnlineSecurity, Tenure, and Contract had a stronger negative correlation with churn. MonthlyCharges, PaperlessBilling, SeniorCitizen, and PaymentMethod had a stronger positive correlation with churn.

Do customers churn due to online security?

# plot a stacked countplot for the OnlineSecurity and Churn columns
fig = px.histogram(data, x="OnlineSecurity", color="Churn", title="Customer Churn Based on OnlineSecurity")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
A graph representing online security count against churn
  • Regarding OnlineSecurity, we found that customers were churning regardless of whether they had online security or not. However, it is not surprising that customers without online security were churning the most, more than twice as much as customers who had online security and customers who had no online presence at all combined.

What is the relationship between the charges and customers' churn?

plt.figure(figsize=[16, 5])
plt.subplot(1,2, 1)
sns.kdeplot(data=data, x="TotalCharges", hue="Churn", multiple='stack')
plt.title('Distribution of Total Charges by Churn')

plt.subplot(1,2, 2)
sns.kdeplot(data=data, x="MonthlyCharges", hue="Churn", multiple='stack')
plt.title('Distribution of Monthly Charges by Churn');
plt.show()
Two density plots representing the distribution of Monthly charges against churn and total charges against churn

Customers with high total charges were not churning, but rather those with low total charges. Similarly, customers with high monthly charges were not churning as much as those with low monthly charges. This could potentially be due to differences in the level or quality of services these groups are receiving.

Does the payment method contribute to why customers churn?

# plot a stacked countplot for the PaymentMethod and Churn columns

fig = px.histogram(data, x="PaymentMethod", color="Churn", title="Customer Churn Based on Payment Method")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
A graph representing counts of payment methods against churn.

In the payment method section, we found that customers who were churning the most were those that used electronic checks, followed by customers who used mail. These methods of payment are slower compared to bank transfers and credit card payments, which had the lowest number of customers churning and are also faster in terms of payment processing.

How does the length of a customer’s contract affect their likelihood of churning?

# plot a stacked countplot for the Contract and Churn columns
fig = px.histogram(data, x="Contract", color="Churn", title="Customer Churn Based on Contract")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
A graph representing counts of payment methods against contract.

The majority of customers opted for the month-to-month contract. Surprisingly, more monthly customers churned than customers who signed up for a yearly contract. The month-to-month option provided customers with the flexibility to try the service and then leave if they were not satisfied with the telco’s service.

What is the Relationship between Monthly Charges, Total Charges and Churn?

# plot a correlation graph between monthly charges, total chargesand customer churn
cat_markers = [['No', 'o'],
['Yes', 's']]
plt.figure(figsize=(8, 5))
for cat, marker in cat_markers:
df = data[data['Churn'] == cat]
plt.scatter(data=df, y='TotalCharges', x='MonthlyCharges', marker = marker)
plt.legend(['No','Yes']);
A correlation between monthly and total charges against churn.

The graph above indicates a strong positive correlation between monthly charges and total charges. However, it is interesting to note that customers who paid lower monthly charges, and hence had lower total charges, were found to be churning from the company. On the other hand, a majority of the high-charges paying customers chose to stay with the company despite their higher costs.

What are the average total amounts customers paid in order to enjoy a particular Internet Service and how did it affect customer Churn?

# group by InternetService, gender, Churn and get mean
inter_gender_df = data.groupby(['InternetService', 'gender', 'Churn'])['TotalCharges'].mean()
inter_gender_df = inter_gender_df.reset_index()


g = sns.FacetGrid(data=inter_gender_df, col='Churn', size=6)
g.map(sns.barplot, 'InternetService', 'TotalCharges', 'gender')
plt.legend()
A graph showing the average total amounts customers paid in order to enjoy a particular internet service and how it affects customer Churn.

Under the customers who churned graph, both males and females paid around $2500 for DSL. Females paid slightly more than males for fiberoptic, and they both paid slightly above $500 for not having internet service.

Under the customers who did not churn graph, surprisingly, males paid a bit more than females for the fiberoptic service.

What is the average Tenure of Senior Citizens and Non-Senior under a particular contract and how does it affect customer Churn?

# group by InternetService, gender, Churn and get mean
sn_gender_df = data.groupby(['SeniorCitizen', 'Contract', 'Churn'])['tenure'].mean()
sn_gender_df = sn_gender_df.reset_index()
A graph showing the average tenure of senior citizens and non-senior citizens under a particular contract and how it affects customer Churn

Non-senior citizens who churned under the two-year contract had a higher average tenure than senior citizens who churned under either the month-to-month or one-year contract. This was also the case for senior citizens.

3.3 Data Quality

The following observations were made about data quality:

  1. The TotalCharges was an object data type instead of float.

4. Data Preparation

In this section we will identify and handle missing or erroneous data, removing duplicates, and converting data types as necessary. Proper data preparation is critical for accurate and meaningful analysis, as the quality of the output depends heavily on the quality of the input data.

4.1 Preparing and Cleaning Data

4.1.1 Converting Data Types

  1. The TotalCharges column had the object datatype but has been converted into a numerical datatype. It also contained None values which were converted into null values.
  2. The CustomerId column was dropped.
  3. The Churn column which are the targets for the machine learning model were changed from “Yes” and “No” to numerical values. “Yes” was replaced with 1 and “No” was replaced with 0.
# display churn column
data = data.drop(columns=['customerID'])


# change values in churn column into numerical data
data['Churn'] = (data['Churn'] == 'Yes').astype(bool).astype(int)
data['Churn'].unique()

array([0, 1])

4.1.2 Splitting Data into Train and Test

The data was split into a train and test data. 80 percent of the data will be used to train our machine learning models and 20 percent of it will be used to evaluate models.

#spliting data into  80% train and  20% test

train, test = train_test_split(data, test_size=0.2, random_state=42)


# check and confirm the shape of the train and test data
train. Shape, test.shape

((5634, 20), (1409, 20))

After the data was split the target values and train features were created.

# create features and targets from the train and test
X_train = train.drop(columns=['Churn'])
y_train = train['Churn'].copy()

X_test = test.drop(columns=['Churn'])
y_test = test['Churn'].copy()

4.1.3 Handling missing values

SimpleImputer class from the sklearn.impute framework was used to impute the missing values in both numerical and categorical features.

  1. Let’s separate the numerical columns from the categorical columns.
  2. The missing values in the numerical column were imputed with a constant value 0.
  3. The missing values in the categorical columns were imputed with most frequent value.
# select the categorical columns from train and test data for encoding
train_cat_cols = X_train.select_dtypes(include=['object', 'category']).columns
test_cat_cols = X_test.select_dtypes(include=['object', 'category']).columns


# impute numeical columns
num = SimpleImputer(strategy="constant", fill_value=0)

# impute the categorical columns
cat = SimpleImputer(strategy=""most_frequent"")

4.1.4 Handling Categorical columns

OneHotEncoder was used to encode the categorical columns. OneHotEncoder has been proven to be effective for encoding categorical columns.

cat_encoder = OneHotEncoder()

4.1.4 Feature Scaling

The StandardScaler will be used to standardize the numerical features. This is to keep them in a range.

# scale the numerical features
num_scaler = StandardScaler()

4.2 Feature Enginering

I assumed the total charges were created by multiplying monthly charges by tenure. After working it out I realised there were some variations in the figures. The results from multiplying those columns weren’t the same as the values in the TotalCharges column. This could mean the monthly charges have not been the same since customers signed up for the service.

The Tenure column was changed into a categorical column with a range of two months.

# let;s create a new feature called Monthly Variations
X_train['Monthly Variations'] = (X_train.loc[:, 'TotalCharges']) -((X_train.loc[:, 'tenure'] * X_train.loc[:, 'MonthlyCharges']))
X_test['Monthly Variations'] = (X_test.loc[:, 'TotalCharges']) - ((X_test.loc[:, 'tenure'] * X_test.loc[:, 'MonthlyCharges']))


# change tenure to a categorical
labels =['{0}-{1}'.format(i, i+2) for i in range(0, 73, 3)]
X_train['tenure_group'] = pd.cut(X_train['tenure'], bins=(range(0, 78, 3)), right=False, labels=labels)
X_test['tenure_group'] = pd.cut(X_test['tenure'], bins=(range(0, 78, 3)), right=False, labels=labels)

print(labels)

['0-2', '3-5', '6-8', '9-11', '12-14', '15-17', '18-20', '21-23', '24-26', '27-29', '30-32', '33-35', '36-38', '39-41', '42-44', '45-47', '48-50', '51-53', '54-56', '57-59', '60-62', '63-65', '66-68', '69-71', '72-74']

Pipelines and ColumnTransformers

  1. Separate pipelines were created for the categorical features and numerical features.
  2. Both pipelines were then fit into a ColumnTransformer to create a full pipeline.
  3. The full pipeline was used to fit transform both train data and test data.
  4. The result was a sparse matrix which was then changed into a pandas dataframe
# create variables to hold numerical and categorical columns 
num_attribs = list(train_num_cols)
cat_attribs = train_cat_cols


#create a numerical pipeline to standardize and impute the missinf in the numerical columns
num_pipeline = Pipeline([('imputer',SimpleImputer(strategy="constant", fill_value=0)),('std_scaler', StandardScaler())])

#create a categorical pipeline to encode and impute the missing in the numerical columns
cat_pipeline = Pipeline([('imputer',SimpleImputer(strategy="most_frequent")),('cat_encoder', OneHotEncoder())])


# Create a fullpipeline by combining numerical and catagorical pioeline
full_pipeline = ColumnTransformer([("numerical",num_pipeline, num_attribs), ("categorical",cat_pipeline, cat_attribs) ], remainder='passthrough')


# use create pipeline to transform train and test features
X_train_prepared = full_pipeline.fit_transform(X_train)
X_test_prepared = full_pipeline.fit_transform(X_test)

4.3 Balancing Train Data

After counting the number of values in the target/labels I realized there were 4138 0’s and 1496 1’s. This is referred to as Imbalanced data. This can affect our model since the target labels are not of the same count.

The SMOTE sampler was used to oversample the minority label so, both labels have equal counts.

This new training data with synthetic values would be used to train our models.

# Count the number of unique values in the target
y_train.value_counts()

0 4138
1 1496
Name: Churn, dtype: int64

# import the SMOTE technique to oversample the minority
from imblearn.over_sampling import RandomOverSampler, SMOTE


# Create an instance of SMOTE and fit it on the train feature and targets
sm = SMOTE(sampling_strategy='minority')
X_train_, y_train = sm.fit_resample(X_train_, y_train)


# let's confirm the increase in rows after oversampling
len(X_train_), len(y_train)

(8276, 8276)



# Confirm values counts for the targets
y_train.value_counts()


0 4138
1 4138
Name: Churn, dtype: int64

5. Modeling

In this section, we will explore various machine learning algorithms to predict customer churn based on the prepared data. We will evaluate the performance of each model and select the best one for deployment.

5.1 Building and validating models

The following are steps to be taken to build and evaluate the models:

  1. Instantiate the classifier.
  2. Train a new model.
  3. Check the performance on test data. Get the accuracy, recall score and confusion matrix for test data.
  4. Save the accuracy, recall, precision, F1 and f2 in a dataframe.

Let’s create a function to plot graph, for confusion matrix, print classification report and return accuracy, precision, recall, f1 score and f2 score.

def evaluate_model(model, test, y_true):
# Compute the valid metrics for the use case # Optional: show the classification report
pred = model.predict(test)
F1 = f1_score(y_true, pred)
accuracy = accuracy_score(y_true, pred)
precision = precision_score(y_true, pred)
recall = recall_score(y_true, pred)
F2 = fbeta_score(y_true, pred, beta=2.0)

print("classification report : \n", classification_report(y_true, pred))
cf = confusion_matrix(y_true, pred)
print("Cofusion matrix report : \n", pd.DataFrame(cf, index=['Negatives', 'Positives']))
sns.heatmap(cf, annot=True)

return accuracy, precision, recall, F1, F2, pred

Let’s create a function that plots “feature of importance” as a graph and returns the most important features.

def get_features_importance(model, train_features, number_of_important_features=None, cutt_off_weight=0):
'''This function displays the feature of importance graph for the tree models
and returns the first 35 most important features
'''
fi = model.feature_importances_
features_df = pd.DataFrame(fi, index=train_features.columns, columns=['Weight Of Importance'])
features_df = features_df.sort_values(by='Weight Of Importance', ascending=False)
features_df = (features_df[features_df['Weight Of Importance'] > cutt_off_weight])
if number_of_important_features is None:
number_of_important_features = len(features_df)
features_fi_cols = features_df.index.tolist()[:number_of_important_features]

# plot graph for feature of importance
plt.figure(figsize=(6,20))
plt.barh(train_features.columns, fi)
plt.show()
return features_fi_cols, features_df

Decision Tree Classifier

Create Model


from sklearn.tree import DecisionTreeClassifier
# let's create a decision tree model
dtree = DecisionTreeClassifier(criterion='gini', random_state=100, min_samples_leaf=8, max_depth=6 )

Train Model

# Use the .fit method to train the model
dtree.fit(X_train_, y_train)

DecisionTreeClassifier(max_depth=6, min_samples_leaf=8, random_state=100)

Evaluate Model

accuracy_dtree, precision_tree, recall_dtree, F1_dtree, F2_dtree, dtree_pred = evaluate_model(dtree, X_test_, y_test)



classification report :
precision recall f1-score support

0 0.90 0.74 0.82 1036
1 0.52 0.77 0.62 373

accuracy 0.75 1409
macro avg 0.71 0.76 0.72 1409
weighted avg 0.80 0.75 0.76 1409

Cofusion matrix report :
0 1
Negatives 771 265
Positives 84 289

Random Forest Classifier

Create Model

# Code here

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(criterion='gini', random_state=100, min_samples_leaf=8, max_depth=6, n_estimators=100)

Train Model

# Use the .fit method to train the model
rfc.fit(X_train_, y_train)

RandomForestClassifier(max_depth=6, min_samples_leaf=8, random_state=100)

Evaluate Model

accuracy_rfc, precision_rfc, recall_rfc, F1_rfc, F2_rfc, rfc_pred = evaluate_model(rfc, X_test_, y_test)
classification report : 
precision recall f1-score support

0 0.90 0.78 0.84 1036
1 0.55 0.76 0.64 373

accuracy 0.78 1409
macro avg 0.73 0.77 0.74 1409
weighted avg 0.81 0.78 0.78 1409

Cofusion matrix report :
0 1
Negatives 807 229
Positives 88 285

Logistic Regression Classifier

Create Model

from sklearn.linear_model import LogisticRegression
lgr_model = LogisticRegression()

Train Model

# Use the .fit method
lgr_model.fit(X_train_, y_train)

LogisticRegression()

Evaluate Model

accuracy_lgr, precision_lgr, recall_lgr, F1_lgr, F2_lgr, lgr_pred = evaluate_model(lgr_model, X_test_, y_test)
classification report : 
precision recall f1-score support
0 0.90 0.74 0.82 1036
1 0.52 0.77 0.62 373
accuracy 0.75 1409
macro avg 0.71 0.76 0.72 1409
weighted avg 0.80 0.75 0.76 1409
Cofusion matrix report :
0 1
Negatives 771 265
Positives 84 289

Support Vector Machine

Create Model

# Create the svm  model
from sklearn import svm
svm = svm.SVC()Train Model

Train Model

# Use the .fit method to train the model
svm.fit(X_train_, y_train)

Evaluate Model

accuracy_svm, precision_svm, recall_svm, F1_svm, F2_svm, svm_pred = evaluate_model(svm, X_test_, y_test)
classification report : 
precision recall f1-score support
0 0.90 0.74 0.82 1036
1 0.52 0.77 0.62 373
accuracy 0.75 1409
macro avg 0.71 0.76 0.72 1409
weighted avg 0.80 0.75 0.76 1409
Cofusion matrix report :
0 1
Negatives 771 265
Positives 84 289

Gradient Boosting Classifier

Create Model

from sklearn.tree from sklearn.ensemble import GradientBoostingClassifier
gb_model = GradientBoostingClassifier()

Train Model

# Use the .fit method
rfc.fit(X_train_, y_train)

GradientBoostingClassifier()

Evaluate Model

accuracy_gb, precision_gb, recall_gb, F1_gb, F2_gb, gb_pred = evaluate_model(gb_model, X_test_, y_test)
classification report : 
precision recall f1-score support

0 0.89 0.84 0.87 1036
1 0.62 0.71 0.66 373

accuracy 0.81 1409
macro avg 0.75 0.78 0.76 1409
weighted avg 0.82 0.81 0.81 1409

Cofusion matrix report :
0 1
Negatives 874 162
Positives 109 264

5.2 Comparing the Model

9 different algorithms were trained. The Tree models were retrained using their most important features. In total there were 11 models to choose from. The best models were selected and fine-tuned.

A dataframe containing the models and their various metric scores.

From the above dataframe we can see that Gradient Boosting model has the best accuracy score, best precision score and best F2 score. The Logistic Regression, Support Vector Machine and Random Forest classifier models have good F1 scores and good accuracy scores.

Unlike the F1 score, which gives equal weight to precision and recall, the F2 score is a measure of a model’s accuracy that takes into account both precision and recall, but places more emphasis on recall than precision. It is a weighted harmonic mean that considers the threshold value used.

I want the model to be more accurate in predicting customers that churn, hence I would use the F2 to select the best models to be finetuned.

5.3 Tuning Model Parameters for Better Performance

The selected models for hyperparameter tuning are Gradient Boosting, Logistic Regression, Support Vector Machine and Random Forest classifier models.

  • Create a dictionary of hyperparameters.
  • Provide the hyperparameter dictionary and the chosen model in BayesSearchCV.
  • Fit the bayes object on training data and get the results of Bayes Search.
  • Check the performance on test data. Get the accuracy, recall score and confusion matrix for test data.
  • Save the accuracy, recall, precision, F1 and f2 in a dataframe.

Hyperparameter tuning for Gradient Boosting

Parameters

# This are the parameters to tested model 
param_grid_gb = {
'min_samples_split': [200, 85, 108, 500, 800],
'loss': ['deviance', 'exponential'],
'learning_rate': np.arange(0.1,1,0.1),
'min_samples_leaf': [60, 40, 35],
'n_estimators': [100, 150, 300],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [8, 9, 12, 6],
'ccp_alpha': [0.3, 0.5]
}

Creating an instance of BayesSeachCV and training the model

# create a bayessearchcv to finetune the Gradient Boosting Classifier regression model

gb_bayes_search = BayesSearchCV(gb_model,param_grid_gb, scoring=f2_scorer, cv=5, return_train_score=True)


gb_bayes_search.fit(X_train_, y_train)

BayesSearchCV(cv=5, estimator=GradientBoostingClassifier(),
return_train_score=True, scoring=make_scorer(fbeta_score, beta=2),
search_spaces={'ccp_alpha': [0.3, 0.5],
'learning_rate': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
'loss': ['deviance', 'exponential'],
'max_depth': [8, 9, 12, 6],
'max_features': ['auto', 'sqrt', 'log2'],
'min_samples_leaf': [60, 40, 35],
'min_samples_split': [200, 85, 108, 500, 800],
'n_estimators': [100, 150, 300]})

Evaluating the performance of the best estimator

# get the best estimator
gb_bayes_search.best_estimator_

GradientBoostingClassifier(ccp_alpha=0.42409884332442466, learning_rate=0.4,
loss='exponential', max_depth=8, max_features='log2',
min_samples_leaf=60, min_samples_split=108)

accuracy_gb_tuned, precision_gb_tuned, recall_gb_tuned, F1_gb_tuned, F2_gb_tuned, gb_pred_tuned\
= evaluate_model(gb_bayes_search.best_estimator_, X_test_, y_test)

classification report :
precision recall f1-score support

0 0.00 0.00 0.00 1036
1 0.26 1.00 0.42 373

accuracy 0.26 1409
macro avg 0.13 0.50 0.21 1409
weighted avg 0.07 0.26 0.11 1409

Cofusion matrix report :
0 1
Negatives 0 1036
Positives 0 373

Hyperparameter tuning for Logistic Regression

Parameters

# This are the parameters to tested model 
# Code here
from skopt import BayesSearchCV
param_grid= [{'penalty': ['l1', 'l2'],
'C': [0.001,0.009,0.01,5, 6.4, 7, 10,25],
'intercept_scaling':[1, 3.7, 2, 4.3, 7, 10],
'max_iter': [5, 50, 100, 150, 200, 400, 300, 500],
'class_weight':['balanced'],
'solver':['saga', 'liblinear'],
'tol': 10.0 ** -np.arange(1, 7),
'l1_ratio': np.arange(0,1, 0.2) }]

Creating an instance of BayesSeachCV and training the model

# create a bayessearchcv to finetune the logistic regression model
logistic_bayes_search = BayesSearchCV(lgr_model,param_grid, scoring=f2_scorer, cv=5, return_train_score=True, n_iter=50)


# train the model
logistic_bayes_search.fit(X_train_, y_train)


BayesSearchCV(cv=5, estimator=LogisticRegression(), return_train_score=True,
scoring=make_scorer(fbeta_score, beta=2),
search_spaces=[{'C': [0.001, 0.009, 0.01, 5, 6.4, 7, 10, 25],
'class_weight': ['balanced'],
'intercept_scaling': [1, 3.7, 2, 4.3, 7, 10],
'l1_ratio': array([0. , 0.2, 0.4, 0.6, 0.8]),
'max_iter': [5, 50, 100, 150, 200, 400, 300, 500],
'penalty': ['l1', 'l2'],
'solver': ['saga', 'liblinear'],
'tol': array([1.e-01, 1.e-02, 1.e-03, 1.e-04, 1.e-05, 1.e-06])}])

Evaluating the performance of the best estimator

# get the best parameter from the bayesearch
logistic_bayes_search.best_estimator_

LogisticRegression(C=7.0, class_weight='balanced', intercept_scaling=4.3,
l1_ratio=0.0, max_iter=5, solver='liblinear', tol=0.001)


accuracy_lgr_tuned, precision_lgr_tuned, recall_lgr_tuned, F1_lgr_tuned, F2_lgr_tuned, lgr_pred_tuned\
= evaluate_model(logistic_bayes_search.best_estimator_, X_test_, y_test)




classification report :
precision recall f1-score support

0 0.92 0.73 0.82 1036
1 0.53 0.83 0.65 373

accuracy 0.76 1409
macro avg 0.73 0.78 0.73 1409
weighted avg 0.82 0.76 0.77 1409

Cofusion matrix report :
0 1
Negatives 761 275
Positives 64 309
A dataframe of the fine tuned models and their metric scores.

5.4 Comparing Fine-tuned Models

A dataframe of the fine tune models with their metric scores.

From the dataframe above the Logistic Regression had the highest F2. Although the model didn’t achieve the precision score in our success criteria it is still our best model.

5.5 Save the model

The model and the ColumnTransformaer object were save using pickle.

# Save the model and the columntransformer
import pickle
filename = 'logistic_reg_class_model.pkl'
filename_2 = 'full_pipeline.pkl'
pickle.dump(logistic_bayes_search.best_estimator_, open(filename, 'wb'))
pickle.dump(full_pipeline, open(filename_2, 'wb'))

6. Evaluation

6.1 Evaluating model performance against success criteria

The logistic regression model came out as the best model after fine-tuning.

  1. It achieved an accuracy of 76%.
  2. It achieved a recall score of 83%
  3. It achieved an F2 score 74%
  4. It achieved a precision score of 53% which was lower than what I intended to achieve.

Overall, I can say this project was a success although a higher precision was not achieved. The precision was traded for a higher recall score. I needed the model to correctly identify a customer as belonging to a class of interest (Churn) than it being more precise.

You can find the original project on GitHub. All feedback and corrections are welcome.

--

--

Bright Eshun

Multi-dimensional data scientist, programmer, and cloud computing enthusiast with a talent for crafting engaging narratives. Follow for innovative insights.