Sepsis Prediction in Patients: Insights from Machine Learning and Clinical Data

Bright Eshun
19 min readJun 10, 2023

1. Introduction

Sepsis, a critical condition in ICUs, poses a significant threat with dysregulated immune response and high mortality rates. Early detection is crucial, and integrating machine learning and clinical data analysis shows promise in predicting sepsis onset. Machine learning algorithms leverage clinical data to identify subtle indicators preceding sepsis manifestation. By harnessing these capabilities, healthcare professionals can potentially intervene earlier, mitigating the severity of sepsis. The objective of this article is to explore sepsis prediction in ICUs, focusing on insights from machine learning and clinical data analysis, and their implications for patient care.

2. Business Understanding

This project aims to develop a sepsis prediction model using machine learning techniques, leveraging clinical data analysis, to improve early detection and intervention in intensive care units, ultimately enhancing patient outcomes and healthcare management.

2.1 Defining the business problem

The business problem addressed in this project is the accurate and early prediction of sepsis in ICUs. Timely detection of sepsis is vital for effective intervention and improved patient outcomes. By developing a reliable sepsis prediction model, healthcare professionals can proactively identify at-risk patients and initiate appropriate treatments, potentially reducing mortality rates and optimizing resource allocation in ICUs.

2.2 Project Objectives and Succes Criteria

The following are the Project Objectives for this project:

  1. Develop a sepsis prediction model using machine learning techniques.
  2. Validate the model’s performance using appropriate evaluation metrics.
  3. Promote the adoption of machine learning and clinical data analysis in sepsis prediction.

Success Criteria:

The success criteria for the project will be determined as follows:

  1. The sepsis prediction machine learning model should achieve a high accuracy rate, with a target accuracy of at least 75%, to effectively predict sepsis in ICU patients.
  2. The model should demonstrate a high precision rate in identifying patients at risk of sepsis, with a target precision of at least 75%, ensuring that the majority of predicted cases are true positive and minimizing false positives.
  3. The model should exhibit a high recall rate in identifying patients who have actually developed sepsis, with a target recall rate of at least 75 and 80% depending on how balance the data is, minimizing false negatives and capturing a significant proportion of true positive cases.

Assumptions:

  1. It was assumed that the blood pressure used was the diastolic type.

2.3 Hypothesis

Null Hypothesis: There is no relationship between high Body Mass Index and sepsis.
Alternate Hypothesis: There is a relationship between high Body Mass Index and sepsis

2.4 Questions

The following are the questions I asked about the data:

  1. How many patients are underweight, have a healthy weight, overweight, obese, and severely obese?
  2. What is the distribution of ages for patients captured in the data?
  3. How many patients fall under the categories of Normal, Elevated, and High Blood Pressure?
  4. Is Body Mass Index affected by Age?
  5. Is Blood Pressure affected by Age?
  6. What is the relationship between Age and Body Mass Index?
  7. How many patients have a tendency to develop sepsis? Which age group is more prone to developing sepsis?
  8. Does having insurance enhance patients’ chances of developing sepsis?
  9. Is body mass directly correlated with a patient’s tendency to develop sepsis?
  10. Are the blood parameters associated with sepsis?

3. Data Understanding

During this phase, we analyze and explore the available clinical data to gain insights into the variables and their relationships, enabling us to better understand the data’s characteristics and potential for sepsis prediction.

3.1 Data Collection and Sources

The data used in this study was generously provided by The Johns Hopkins University, a renowned institution located at Johns Hopkins Road, Laurel, MD 20707. The dataset is a modified version of a publicly available data source, and its usage is subject to copyright restrictions.

3.2 Description of the data used in the study

The clinical data used in this study consists of various attributes related to patients in an intensive care unit (ICU). These attributes provide valuable insights into the patients’ health status and help in predicting the onset of sepsis, a critical condition that requires timely intervention.

The dataset includes the following attributes:

  1. ID: Each patient is assigned a unique identification number, allowing for individual tracking and analysis.
  2. PRG (Plasma glucose): This attribute represents the plasma glucose levels of the patients. Glucose levels can serve as an indicator of metabolic health and can provide insights into the patients’ overall condition.
  3. PL (Blood Work Result-1): This attribute indicates the first blood work result measured in mu U/ml. Blood work results are essential for evaluating the patients’ biochemical profiles and identifying any abnormalities.
  4. PR (Blood Pressure): The blood pressure of patients, measured in mm Hg, is captured in this attribute. Blood pressure is a crucial vital sign that can indicate the patients’ cardiovascular health and potential risks.
  5. SK (Blood Work Result-2): This attribute represents the second blood work result measured in mm. Similar to Blood Work Result-1, this attribute provides additional information about the patients’ blood chemistry and overall health.
  6. TS (Blood Work Result-3): The third blood work result, measured in mu U/ml, is captured in this attribute. Blood work results, especially when analyzed collectively, can offer insights into the patients’ organ function and potential abnormalities.
  7. M11 (Body mass index): Body mass index (BMI) is a measure of weight in relation to height and provides an indication of patients’ body composition and potential risks associated with obesity or malnutrition.
  8. BD2 (Blood Work Result-4): This attribute represents the fourth blood work result measured in mu U/ml. Including multiple blood work results allows for a more comprehensive assessment of the patients’ physiological state.
  9. Age: The age of the patients, measured in years, provides information about their demographic characteristics and potential age-related factors that may contribute to sepsis risk.
  10. Insurance: This attribute indicates whether a patient holds a valid insurance card, which can be relevant in analyzing the influence of insurance coverage on sepsis prediction and access to healthcare resources.
  11. Sepsis: The target attribute represents whether a patient in the ICU will develop sepsis (Positive) or not (Negative). This attribute is the main focus of the study and serves as the basis for sepsis prediction.

3.3 Exploratory Data Analysis

The Exploratory Data Analysis (EDA) phase is a crucial step in understanding the dataset and gaining insights into the variables and their relationships. In this phase, we will thoroughly examine the data, visualize its patterns and distributions, and uncover any interesting trends or anomalies. By performing EDA, we aim to discover key features, identify potential data issues, and make informed decisions about data preprocessing and modeling strategies. This comprehensive analysis will lay the foundation for further exploration and the development of accurate and effective predictive models for sepsis prediction in intensive care units.

3.3.1 Overview of Dataset

Importing Packages

Let’s import the libraries that will be used for the project.

# Data handling
# Data handling
import pandas as pd
import numpy as np

# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

# Feature Processing (Scikit-learn processing, etc. )
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import confusion_matrix , classification_report, f1_score, accuracy_score,\
precision_score, recall_score, fbeta_score, make_scorer, roc_auc_score
from sklearn.model_selection import train_test_split, GridSearchCV
from skopt import BayesSearchCV
from sklearn.utils import class_weight
# models
from sklearn import svm
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

import pickle
import os
import warnings
warnings.filterwarnings("ignore")

Loading dataset

# For CSV, use pandas.read_csv
df = pd.read_csv('../datasets/Paitients_Files_Train.csv')

df.head(5)

Check general properties

# display the datatypes
df.dtypes

ID object
PRG int64
PL int64
PR int64
SK int64
TS int64
M11 float64
BD2 float64
Age int64
Insurance int64
Sepssis object
dtype: object

Check the shape of the data

# check the shape of the data
df.shape

(599, 11)

Check for null values in the data

# check for null values
df.isnull().sum()

ID 0
PRG 0
PL 0
PR 0
SK 0
TS 0
M11 0
BD2 0
Age 0
Insurance 0
Sepssis 0
dtype: int64

Check for column names

# check columns 
df.columns

Index(['ID', 'PRG', 'PL', 'PR', 'SK', 'TS', 'M11', 'BD2', 'Age', 'Insurance',
'Sepssis'],
dtype='object')

Check for duplicates

# check the number of duplicates
df.duplicated().sum()

0

Distribution patterns for numerical columns

The distributions of the columns: Blood Work Results-4, Blood Work Results-2, Age and Blood Work Results-3 are skewed to the right. Blood Work Results-1 and Body Mass Index and Blood Pressure almost have symmetrical distribution.

Class balance

The dataset is imbalanced. The number of negative statuses for sepsis is almost twice as much as that for positives

3.3.2 Data quality

Data quality is essential for reliable results; thorough examination of the dataset for missing values, outliers, and inconsistencies ensures optimal performance of prediction models, maintaining integrity and validity.

The following observations were made about data quality:

  1. The Insurance column is a float data type instead of a category/object.
  2. Undescriptive column names

3.3.3 Hypothesis testing

import scipy.stats as stats

# Select the BMI and sepsis columns from the dataset
bmi = df['Body Mass Index']
sepsis = (df['Sepssis'] == 'Positive').astype(bool).astype(int)

# Perform correlation analysis
correlation, p_value = stats.pearsonr(bmi, sepsis)

# Print the correlation coefficient and p-value
print("Correlation coefficient:", correlation)
print("P-value:", p_value)

if p_value > 0.05:
print('Fail to reject the null hypothesis. ')
else:
print('Reject the null hypothesis'))

Correlation coefficient: 0.31589377926855083
P-value: 2.3972519626653513e-15
Reject the null hypothesis

This means there is some relationship between body mass index and sepsis.

3.3.4 Visualize and Analyze data

We will answer the question raised from our dataset.

Rename column names to descriptive names

df = df.rename(columns={'PRG': 'Plasma Glucose', 'PL': 'Blood Work Result-1',
'PR': 'Blood Pressure','SK': 'Blood Work Result-2',
'TS': 'Blood Work Result-3', 'M11': 'Body Mass Index',
'BD2': 'Blood Work Result-4'})

Base on the assumptions made, some functions were used to categorize the Blood Pressure and Body Mass Index.

# function to create a new column 'Bmi'
def create_bmi_range(row):
if (row['Body Mass Index'] <= 18.5):
return 'Under Weight'
elif (row['Body Mass Index'] > 18.5) and (row['Body Mass Index'] <= 24.9):
return 'Healthy Weight'
elif (row['Body Mass Index'] > 24.9) and (row['Body Mass Index'] <= 29.9):
return 'Over Weight'
elif (row['Body Mass Index'] > 29.9) and (row['Body Mass Index'] < 40):
return 'Obesity'
elif row['Body Mass Index'] >= 40:
return 'Severe Obesity'


# create a function to create a new column called blood pressure ranges
def blood_pressure_ranges(row):
if row['Blood Pressure'] < 80:
return 'normal'
elif row['Blood Pressure'] >= 80 and row['Blood Pressure'] <= 89:
return 'elevated'
elif row['Blood Pressure'] >= 90:
return 'high'

How many patients fell under weight, healthy weight, overweight, obese and severe obesity?

  • Based on the graph above, it can be observed that the majority of patients fall under the obesity category. The next highest category is overweight, followed by the healthy weight category. The underweight category has the lowest number of patients.

Is body mass directly correlated with a patient's tendency to get sepsis?

  • Based on the graph above, it can be observed that a higher proportion of patients with obesity and severe obesity had sepsis compared to patients within the healthy weight and underweight categories. This suggests a potential association between body mass ranges and the likelihood of developing sepsis.

How does Blood Pressure and Plasma glucose affect the Sepsis

  • In both Positive and Negative Sepsis there is almost no correlation between Plasma and Blood Pressure

“Refer to notebook for the rest of the questions.”

4. Data Preparation

In this section we will identify and handle missing or erroneous data, removing duplicates, and converting data types as necessary. Proper data preparation is critical for accurate and meaningful analysis, as the quality of the output depends heavily on the quality of the input data.

4.1 Data preprocessing

Data preprocessing techniques are applied to prepare the dataset for analysis, including handling missing values, normalizing features, encoding variables, and scaling the data. This involves examining statistical properties, identifying outliers, and evaluating data accuracy and completeness. By understanding and addressing data quality issues, we ensure the accuracy and reliability of the model.

4.1.1 Converting Data Types

  1. The Insurance column was converted into a category data type
  2. The ID column was dropped.
  3. The Sepsis column which contains the target values for the machine learning model were changed from “positive” and “negative” to numerical values. “Positive” was replaced with 1 and “Negative” was replaced with 0.
# Drop the ID column
df_ = df.drop(columns=['ID'])
# convert the Insurance into a categorical column
df_['Insurance'] = df['Insurance'].astype('category')
#  change values in Sepsis column into numerical data
df_['Sepssis'] = (df_['Sepssis'] == 'Positive').astype(bool).astype(int)
df_['Sepssis'].unique()

array([1, 0])

4.1.2 Splitting Data into Train and Test

The data was split into a train and test data. 80 percent of the data will be used to train our machine learning models and 20 percent of it will be used to evaluate models.

Split data into 80% train and 20% test


# Use train_test_split with a random_state, and add stratify for Classification
#spliting data into 80% train and 20% test
train, test = train_test_split(df_, test_size=0.2, random_state=42)
print(f'Train: {train.shape}, Test: {test.shape}')

Train: (479, 10), Test: (120, 10)

The target feature and train features were separated.

# create features and targets from the train data
X_train = train.drop(columns=['Sepssis'])
y_train = train['Sepssis'].copy()

# create features and targets from test data
X_test = test.drop(columns=['Sepssis'])
y_test = test['Sepssis'].copy()

4.1.3 Feature Processing and Feature Engineering

Feature processing involves the manipulation or transformation of individual features in the dataset. It is applying various techniques to modify the raw features or derive new features from the existing ones.

Create new features

# get the products of the all the numerical columns

X_train['All-Product'] = X_train['Blood Work Result-4'] * X_train['Blood Work Result-1']*\
X_train['Blood Work Result-2']* X_train['Blood Work Result-3'] * X_train['Plasma Glucose']\
*X_train['Blood Pressure'] * X_train['Age']* X_train['Body Mass Index']
  1. The code calculates the product of all the numerical columns and creates a new column called ‘All-Product’ in the training and testing datasets.

Feature engineering involves creating new features or transforming existing features to enhance the predictive power of the machine learning models. The aim is to capture relevant information and patterns from the data that may improve the model’s performance and predictive accuracy.

Create new features


# get the categories from the products of all numerical feature
blood_max = X_train['All-Product'].max()
bin_max = 3500000000000
# create a new column 'Age Group'
all_labels =['{0}-{1}'.format(i, i+500000000000) for i in range(0, round(blood_max),500000000000)]
X_train['All-Product_range'] = pd.cut(X_train['All-Product'], bins=(range(0, bin_max, 500000000000)), right=False, labels=all_labels)

print(all_labels)



# get the min and max o fthe ages
age_min = df['Age'].min()
age_max = df['Age'].max()

# create a new column 'Age Group'
age_labels =['{0}-{1}'.format(i, i+20) for i in range(0, age_max,20)]
X_train['Age Group'] = pd.cut(X_train['Age'], bins=(range(0, 120, 20)), right=False, labels=age_labels)

print(age_labels)



# get the max of the bmi
bmi_max = df['Body Mass Index'].max()
# create a new column 'Age Group'
labels =['{0}-{1}'.format(i, i+30) for i in range(0, round(bmi_max),30)]
X_train['BMI_range'] = pd.cut(X_train['Body Mass Index'], bins=(range(0, 120, 30)), right=False, labels=labels)

print(labels)
print(bmi_max)




# get the max of blood pressure
bp_max = df['Blood Pressure'].max()
# create a new column 'Age Group'
labels =['{0}-{1}'.format(i, i+50) for i in range(0, round(bp_max),50)]
X_train['BP_range'] = pd.cut(X_train['Blood Pressure'], bins=(range(0, 200, 50)), right=False, labels=labels)
X_test['BP_range'] = pd.cut(X_test['Blood Pressure'], bins=(range(0, 200, 50)), right=False, labels=labels)

print(labels)


# get max of plasma glucose
# age_min = df['Age'].min()
pg_max = df['Plasma Glucose'].max()
# create a new column 'Age Group'
labels =['{0}-{1}'.format(i, i+7) for i in range(0, round(pg_max),7)]
X_train['PG_range'] = pd.cut(X_train['Plasma Glucose'], bins=(range(0, 28, 7)), right=False, labels=labels)
  1. The code creates a new column called ‘All-Product_range’ by categorizing the ‘All-Product’ values into bins based on a specified range.
  2. The code creates a new column called ‘Age Group’ by categorizing the ‘Age’ values into bins based on a specified range.
  3. Another new column called ‘BMI_range’ is created by categorizing the ‘Body Mass Index’ values into bins based on a specified range.
  4. Similarly, a new column called ‘BP_range’ is created by categorizing the ‘Blood Pressure’ values into bins based on a specified range.
  5. Finally, a new column called ‘PG_range’ is created by categorizing the ‘Plasma Glucose’ values into bins based on a specified range.

4.1.4 Handling missing values

The missing values in both numerical and categorical features were imputed using the SimpleImputer class from the sklearn.impute framework.

  1. Let’s separate the numerical columns from the categorical columns.
  2. The missing values in both numerical and categorical columns were imputed with the most frequent value.
# select the categorical columns from train and test data for encoding
train_cat_cols = X_train.select_dtypes(include=['object', 'category']).columns
test_cat_cols = X_test.select_dtypes(include=['object', 'category']).columns
# train categoricak columns is the same as test categorical columns
train_cat_cols == test_cat_cols

array([ True, True, True, True, True, True])
# impute numeical columns
num = SimpleImputer(strategy="mean")

# impute the categorical columns
cat = SimpleImputer(strategy="most_frequent")

4.1.5 Handling Categorical columns

The categorical columns were encoded using the OneHotEncoder method. This encoding technique transforms categorical variables into binary vectors, allowing them to be included in the machine learning model’s numerical calculations.

cat_encoder = OneHotEncoder()

4.1.6 Feature Scaling

To standardize the numerical features and bring them to a common scale, the StandardScaler method will be applied.

# scale the numerical features
num_scaler = StandardScaler()

4.1.7 Pipelines and ColumnTransformers

  1. Separate pipelines were created for the categorical features and numerical features.
  2. Both pipelines were then fit into a ColumnTransformer to create a preprocessor pipeline.
  3. The preprocessor pipeline was used to fit transform train data and transform test data.
  4. The result was a numpy array which was then converted into a pandas dataframe
# create variables to hold numerical and categorical columns 
num_attribs = list(train_num_cols)
cat_attribs = list(train_cat_cols)
#create a numerical pipeline to standardize and impute the missinf in the numerical columns
num_pipeline = Pipeline([('imputer',SimpleImputer(strategy="most_frequent")),('std_scaler', StandardScaler())])

#create a categorical pipeline to encode and impute the missing in the numerical columns
cat_pipeline = Pipeline([('imputer',SimpleImputer(strategy="most_frequent")),('cat_encoder', OneHotEncoder(handle_unknown='ignore'))])
# Create a fullpipeline by combining numerical and catagorical pioeline
preprocessor = ColumnTransformer([("numerical",num_pipeline, num_attribs), ("categorical",cat_pipeline, cat_attribs)], )
# use create pipeline to transform train and test features
X_train_prepared = preprocessor.fit_transform(X_train)

Note: Methods applied on the train data were also applied on the test data.

4.2 Feature selection

Feature selection is crucial for effective prediction models; by identifying the most relevant features and leveraging domain knowledge, we can improve accuracy, reduce complexity, and focus on key predictors of sepsis in the dataset.

Several columns were dropped from the training and testing datasets. The columns dropped include ‘Blood Pressure’, ‘Age’, ‘Body Mass Index’, ‘Plasma Glucose’, ‘All-Product’, ‘Blood Work Result-3’, and ‘Blood Work Result-2’. ‘Blood Work Result-3’, and ‘Blood Work Result-2’ had very low correlation with the target.


df_corr = df_.corr()
df_corr['Sepssis'].sort_values(ascending=False)

Sepssis 1.000000
Blood Work Result-1 0.449719
Body Mass Index 0.315894
Age 0.210234
Plasma Glucose 0.207115
Blood Work Result-4 0.181561
Blood Work Result-3 0.145892
Blood Work Result-2 0.075585
Blood Pressure 0.061086
Name: Sepssis, dtype: float64

‘Blood Pressure’, ‘Age’, ‘Body Mass Index’, ‘Plasma Glucose’, ‘All-Product’ were replaced with their categorical features.

# drop columns 
X_train = X_train.drop(columns=['Blood Pressure', 'Age', 'Body Mass Index','Plasma Glucose', 'All-Product', 'Blood Work Result-3', 'Blood Work Result-2'])
X_test = X_test.drop(columns=['Blood Pressure', 'Age', 'Body Mass Index', 'Plasma Glucose', 'All-Product', 'Blood Work Result-3', 'Blood Work Result-2'])

5. Modeling

In this section, we aim to explore and evaluate machine learning algorithms for sepsis prediction. Our goal is to select the most effective model by assessing metrics such as accuracy, precision, area under curve and recall. This process ensures the deployment of a model that accurately detects sepsis and improves patient outcomes.

5.1 Building and validating models

The following are steps to be taken to build and evaluate the models:

  1. Instantiate the classifier.
  2. Train a new model.
  3. Check the performance on test data. Get the accuracy, recall and area under curve scores.
  4. Save the accuracy, recall, precision, area under cure scores in a dataframe.

Let’s create a function to plot graph, for confusion matrix, print classification report and return accuracy, precision, recall, f1 score and f2 score.

def evaluate_model(model, x_test, y_test):
pred = model.predict(x_test)
auc_score = roc_auc_score(y_test, model.predict_proba(X_test_df)[:, 1])
accuracy = accuracy_score(y_test, pred)
f1 = f1_score(y_test, pred)
precision = precision_score(y_test, pred)
recall = recall_score(y_test, pred)
return accuracy, f1, precision, recall, auc_score

Logistic Regression Classifier

Create Model

# create model
lgr_model = LogisticRegression()

Train Model

# Use the .fit method
lgr_model.fit(X_train_df, y_train)

LogisticRegression()

Evaluate Model

accuracy, f1, precision, recall, auc_score = evaluate_model(lgr_model, X_test_df, y_test)
accuracy, f1, precision, recall, auc_score


(0.825, 0.7469879518072289, 0.775, 0.7209302325581395, 0.8091211114466929)

Random Forest Classifier

Create Model

# create a randon forest model
rfc = RandomForestClassifier()

Train Model

# Use the .fit method to train the model
rfc.fit(X_train_df, y_train)

RandomForestClassifier()

Evaluate Model

accuracy, f1, precision, recall, auc_score = evaluate_model(rfc, X_test_df, y_test)
accuracy, f1, precision, recall, auc_score

(0.7083333333333334, 0.6153846153846155, 0.5833333333333334, 0.6511627906976745, 0.7680459075807913)
Image of results from training models

5.2 Comparing the Model

Definition of Metrics

Accuracy: Accuracy measures the overall correctness of the model’s predictions by calculating the ratio of correctly predicted instances to the total number of instances.

F1 Score: The F1 score is a measure of a model’s accuracy, combining both precision and recall. It considers both false positives and false negatives and provides a single metric that balances both precision and recall.

Precision: Precision measures the proportion of correctly predicted positive instances out of the total instances predicted as positive. It focuses on the model’s ability to minimize false positives, providing insights into the model’s precision in identifying true positive cases.

Recall: Recall, also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive instances out of the total actual positive instances. It emphasizes the model’s ability to minimize false negatives, providing insights into the model’s ability to identify all positive cases.

AUC Score: The Area Under the Curve (AUC) score is a performance metric used for binary classification models. It represents the model’s ability to distinguish between positive and negative instances by calculating the area under the receiver operating characteristic (ROC) curve. A higher AUC score indicates better performance in terms of distinguishing between classes.

Interpretation of results

In the given table, different machine learning algorithms are evaluated based on these metrics. The Logistic Regression Classifier achieved an accuracy of 0.825, an F1 score of 0.747, precision of 0.775, recall of 0.721, and an AUC score of 0.809. The Support Vector and Random Forest Classifiers performed well as well. These 3 models will be finetuned and compared.

5.3 Tuning Model Parameters for Better Performance

The selected models for hyperparameter tuning are Logistic Regression, Support Vector Machine and Random Forest classifier models.

  • Create a dictionary of hyperparameters.
  • Provide the hyperparameter dictionary and the chosen model in GridSearchCV.
  • Fit the Grid object on training data and get the results of GridSearchCV.
  • Check the performance on test data. Get the accuracy, recall score and area under curve scores for test data.
  • Save the accuracy, recall, precision, F1 scoreand area under curve in a dataframe.

Hyperparameter tuning for Logistic Regression

Set up metrics

# # make these metrics scorers
accuracy_score_ = make_scorer(accuracy_score)
f1_score_ = make_scorer(f1_score)
precision_score_ = make_scorer(precision_score)
recall_ = make_scorer(recall_score)

scoring = {
'accuracy': accuracy_score_,
'f1_score': f1_score_,
'precision': precision_score_,
'f2_score': f2_scorer_,
'recall': recall_
}

Parameters

class_weights = [weight3]
param_grid= [{'penalty': ['l1', 'l2'],\
'C': [0.001, 1, 10, 50, 80, 100], 'intercept_scaling':[1, 0.4, 5, 10],
'max_iter': [100, 500, 1000, 8, 18], 'class_weight': class_weights,
'solver':['saga', 'liblinear', 'newton-cholesky'],
'l1_ratio': np.arange(0, 1, 0.2) , 'random_state':[126, 140, 156]}]

Creating an instance of GridSeachCV and training the model

# create a gridsearchcv to finetune the logistic regression model
logistic_grid_search = GridSearchCV(lgr_model,param_grid, scoring=scoring, cv=10, return_train_score=True, refit='accuracy')
# train the model
logistic_grid_search.fit(X_train_df, y_train)

GridSearchCV(cv=10, estimator=LogisticRegression(),
param_grid=[{'C': [0.001, 1, 10, 50, 80, 100],
'class_weight': [{0: 0.75, 1: 1.8}],
'intercept_scaling': [1, 0.4, 5, 10],
'l1_ratio': array([0. , 0.2, 0.4, 0.6, 0.8]),
'max_iter': [100, 500, 1000, 8, 18],
'penalty': ['l1', 'l2'],
'random_state': [126, 140, 156],
'solver': ['saga', 'liblinear', 'newton-cholesky']}],
refit='accuracy', return_train_score=True,
scoring={'accuracy': make_scorer(accuracy_score),
'f1_score': make_scorer(f1_score),
'f2_score': make_scorer(fbeta_score, beta=2),
'precision': make_scorer(precision_score),
'recall': make_scorer(recall_score)})

Get the best estimator

# get the best parameter from the grid
best_lgr = logistic_grid_search.best_estimator_
best_lgr


LogisticRegression(C=1, class_weight={0: 0.75, 1: 1.8}, l1_ratio=0.0,
max_iter=8, penalty='l1', random_state=140,
solver='liblinear')

Evaluating the performance of the best estimator

accuracy, f1, precision, recall, auc_score = evaluate_model(best_lgr, X_test_df, y_test)
accuracy, f1, precision, recall, auc_score

(0.7083333333333334, 0.6666666666666667, 0.5645161290322581, 0.813953488372093, 0.8091211114466929)

5.4 Comparing Fine-tuned Models

From the graph above the Logistic Regression had the highest F2. Although the model didn’t achieve the acuracy and precision scores in our success criteria it is still our best model.

5.5 Save the model

The model and the ColumnTransformaer object were save using pickle.

# Save the model and the columntransformer
import pickle
filename = 'logistic_reg_class_model.pkl'
filename_2 = 'full_pipeline.pkl'
pickle.dump(logistic_grid_search.best_estimator_, open(filename, 'wb'))
pickle.dump(full_pipeline, open(filename_2, 'wb'))

6. Evaluation

6.1 Evaluating model performance against success criteria

The logistic regression model came out as the best model after fine-tuning.

The model performance falls short of the success criteria set for the project.

  1. The accuracy achieved is 70.83%, which is below the target of 75%.
  2. The F1 score is 0.667, indicating a moderate balance between precision and recall.
  3. The precision achieved is 0.0.564, which is below the target of 75%, indicating a relatively high number of false positives.
  4. The recall achieved is 0.814, exceeding the target of 75% and capturing a significant proportion of true positive cases.
  5. The AUC score is 0.809, indicating a good overall performance of the model.

7. Model Interpretation

Gaining insights into how a model makes predictions and understanding its decision-making process is crucial for model interpretation. By analyzing feature importance, feature effects, and model explanations, we can gain valuable insights into the inner workings of the model.

For sample data



shap.initjs()

explainer = shap.Explainer(best_lgr.predict, X_train_df)
shap_values = explainer(X_test_df)
def features_explainer(index, type='condensed'):
if type == 'condensed':
return shap.plots.force(shap_values[index])
elif type == 'waterfall':
return shap.plots.waterfall(shap_values[index])
else:
return 'Select a valid type of plot: "condensed" or "waterfall"'

features_explainer(0, type='waterfall')

From the shap plots, the most important feature for predicting the first sample/row of the test data is Blood work results-1 followed by the x_0–30-category (bmi values between 0 and 30).

For Test Data

# Let's plot the featureof importance for the logistic regression model
shap.plots.bar(shap_values)

Based on the predictions of the entire test data, the top five important features for predicting the test data are Blood work results-1, x3_0–30 (bmi values between 0 and 30), Blood work results-1, x3_30–60, and Insurance_1 in descending order of importance.

Limitations and Future directions

Future Directions for the Project:

  1. Addressing Data Imbalance:
  • Employ techniques like oversampling or undersampling to tackle the data imbalance issue.

2. Exploring Alternative Algorithms:

  • Explore different machine learning algorithms to find the ones that better capture the complexity of sepsis prediction.
  • Fine-tune the parameters of these algorithms to optimize their performance.

3. Feature Engineering and Incorporating Domain Knowledge:

  • Engage in further feature engineering to identify and engineer new features that have a strong relationship with sepsis.
  • Incorporate domain knowledge to ensure the inclusion of relevant features that contribute to sepsis prediction.

4. Incorporating Additional Relevant Features:

  • Include additional features that can enhance the predictive power of the model.
  • Identify and incorporate features that align with domain expertise and have a significant impact on sepsis prediction.

By addressing these limitations and pursuing these future directions, the project can make significant advancements in predicting sepsis. This, in turn, will lead to improved patient care and outcomes in the context of sepsis management.

Resources

Scikit-Learn Documentation

You can find the original project on GitHub. All feedback and corrections are welcome.

--

--

Bright Eshun

Multi-dimensional data scientist, programmer, and cloud computing enthusiast with a talent for crafting engaging narratives. Follow for innovative insights.