# Sepsis Prediction in Patients: Insights from Machine Learning and Clinical Data

## 1. Introduction

Sepsis, a critical condition in ICUs, poses a significant threat with dysregulated immune response and high mortality rates. Early detection is crucial, and integrating machine learning and clinical data analysis shows promise in predicting sepsis onset. Machine learning algorithms leverage clinical data to identify subtle indicators preceding sepsis manifestation. By harnessing these capabilities, healthcare professionals can potentially intervene earlier, mitigating the severity of sepsis. The objective of this article is to explore sepsis prediction in ICUs, focusing on insights from machine learning and clinical data analysis, and their implications for patient care.

## 2. Business Understanding

This project aims to develop a sepsis prediction model using machine learning techniques, leveraging clinical data analysis, to improve early detection and intervention in intensive care units, ultimately enhancing patient outcomes and healthcare management.

**2.1 Defining the business problem**

The business problem addressed in this project is the accurate and early prediction of sepsis in ICUs. Timely detection of sepsis is vital for effective intervention and improved patient outcomes. By developing a reliable sepsis prediction model, healthcare professionals can proactively identify at-risk patients and initiate appropriate treatments, potentially reducing mortality rates and optimizing resource allocation in ICUs.

**2.2 Project Objectives and Succes Criteria**

The following are the Project Objectives for this project:

- Develop a sepsis prediction model using machine learning techniques.
- Validate the model’s performance using appropriate evaluation metrics.
- Promote the adoption of machine learning and clinical data analysis in sepsis prediction.

Success Criteria:

The success criteria for the project will be determined as follows:

- The sepsis prediction machine learning model should achieve a high accuracy rate, with a target accuracy of at least 75%, to effectively predict sepsis in ICU patients.
- The model should demonstrate a high precision rate in identifying patients at risk of sepsis, with a target precision of at least 75%, ensuring that the majority of predicted cases are true positive and minimizing false positives.
- The model should exhibit a high recall rate in identifying patients who have actually developed sepsis, with a target recall rate of at least 75 and 80% depending on how balance the data is, minimizing false negatives and capturing a significant proportion of true positive cases.

## Assumptions:

- It was assumed that the blood pressure used was the diastolic type.

**2.3 Hypothesis**

**Null Hypothesis**: There is no relationship between high Body Mass Index and sepsis.**Alternate Hypothesis**: There is a relationship between high Body Mass Index and sepsis

**2.4 Questions**

The following are the questions I asked about the data:

- How many patients are underweight, have a healthy weight, overweight, obese, and severely obese?
- What is the distribution of ages for patients captured in the data?
- How many patients fall under the categories of Normal, Elevated, and High Blood Pressure?
- Is Body Mass Index affected by Age?
- Is Blood Pressure affected by Age?
- What is the relationship between Age and Body Mass Index?
- How many patients have a tendency to develop sepsis? Which age group is more prone to developing sepsis?
- Does having insurance enhance patients’ chances of developing sepsis?
- Is body mass directly correlated with a patient’s tendency to develop sepsis?
- Are the blood parameters associated with sepsis?

## 3. Data Understanding

During this phase, we analyze and explore the available clinical data to gain insights into the variables and their relationships, enabling us to better understand the data’s characteristics and potential for sepsis prediction.

**3.1 Data Collection and Sources**

The data used in this study was generously provided by The Johns Hopkins University, a renowned institution located at Johns Hopkins Road, Laurel, MD 20707. The dataset is a modified version of a publicly available data source, and its usage is subject to copyright restrictions.

**3.2 Description of the data used in the study**

The clinical data used in this study consists of various attributes related to patients in an intensive care unit (ICU). These attributes provide valuable insights into the patients’ health status and help in predicting the onset of sepsis, a critical condition that requires timely intervention.

The dataset includes the following attributes:

**ID**: Each patient is assigned a unique identification number, allowing for individual tracking and analysis.**PRG**(Plasma glucose): This attribute represents the plasma glucose levels of the patients. Glucose levels can serve as an indicator of metabolic health and can provide insights into the patients’ overall condition.**PL**(Blood Work Result-1): This attribute indicates the first blood work result measured in mu U/ml. Blood work results are essential for evaluating the patients’ biochemical profiles and identifying any abnormalities.**PR**(Blood Pressure): The blood pressure of patients, measured in mm Hg, is captured in this attribute. Blood pressure is a crucial vital sign that can indicate the patients’ cardiovascular health and potential risks.**SK**(Blood Work Result-2): This attribute represents the second blood work result measured in mm. Similar to Blood Work Result-1, this attribute provides additional information about the patients’ blood chemistry and overall health.**TS**(Blood Work Result-3): The third blood work result, measured in mu U/ml, is captured in this attribute. Blood work results, especially when analyzed collectively, can offer insights into the patients’ organ function and potential abnormalities.**M11**(Body mass index): Body mass index (BMI) is a measure of weight in relation to height and provides an indication of patients’ body composition and potential risks associated with obesity or malnutrition.**BD2**(Blood Work Result-4): This attribute represents the fourth blood work result measured in mu U/ml. Including multiple blood work results allows for a more comprehensive assessment of the patients’ physiological state.**Age**: The age of the patients, measured in years, provides information about their demographic characteristics and potential age-related factors that may contribute to sepsis risk.**Insurance**: This attribute indicates whether a patient holds a valid insurance card, which can be relevant in analyzing the influence of insurance coverage on sepsis prediction and access to healthcare resources.**Sepsis**: The target attribute represents whether a patient in the ICU will develop sepsis (Positive) or not (Negative). This attribute is the main focus of the study and serves as the basis for sepsis prediction.

**3.3 Exploratory Data Analysis**

The Exploratory Data Analysis (EDA) phase is a crucial step in understanding the dataset and gaining insights into the variables and their relationships. In this phase, we will thoroughly examine the data, visualize its patterns and distributions, and uncover any interesting trends or anomalies. By performing EDA, we aim to discover key features, identify potential data issues, and make informed decisions about data preprocessing and modeling strategies. This comprehensive analysis will lay the foundation for further exploration and the development of accurate and effective predictive models for sepsis prediction in intensive care units.

*3.3.1 Overview of Dataset*

Importing Packages

Let’s import the libraries that will be used for the project.

`# Data handling`

# Data handling

import pandas as pd

import numpy as np

# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )

%matplotlib inline

import matplotlib.pyplot as plt

import seaborn as sns

import plotly.graph_objects as go

import plotly.express as px

# Feature Processing (Scikit-learn processing, etc. )

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.metrics import confusion_matrix , classification_report, f1_score, accuracy_score,\

precision_score, recall_score, fbeta_score, make_scorer, roc_auc_score

from sklearn.model_selection import train_test_split, GridSearchCV

from skopt import BayesSearchCV

from sklearn.utils import class_weight

# models

from sklearn import svm

from xgboost import XGBClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

import pickle

import os

import warnings

warnings.filterwarnings("ignore")

Loading dataset

`# For CSV, use pandas.read_csv`

df = pd.read_csv('../datasets/Paitients_Files_Train.csv')

df.head(5)

Check general properties

`# display the datatypes`

df.dtypes

ID object

PRG int64

PL int64

PR int64

SK int64

TS int64

M11 float64

BD2 float64

Age int64

Insurance int64

Sepssis object

dtype: object

Check the shape of the data

`# check the shape of the data`

df.shape

(599, 11)

Check for null values in the data

`# check for null values`

df.isnull().sum()

ID 0

PRG 0

PL 0

PR 0

SK 0

TS 0

M11 0

BD2 0

Age 0

Insurance 0

Sepssis 0

dtype: int64

Check for column names

`# check columns `

df.columns

Index(['ID', 'PRG', 'PL', 'PR', 'SK', 'TS', 'M11', 'BD2', 'Age', 'Insurance',

'Sepssis'],

dtype='object')

Check for duplicates

`# check the number of duplicates`

df.duplicated().sum()

0

Distribution patterns for numerical columns

The distributions of the columns: Blood Work Results-4, Blood Work Results-2, Age and Blood Work Results-3 are skewed to the right. Blood Work Results-1 and Body Mass Index and Blood Pressure almost have symmetrical distribution.

Class balance

The dataset is imbalanced. The number of negative statuses for sepsis is almost twice as much as that for positives

**3.3.2 Data quality**

Data quality is essential for reliable results; thorough examination of the dataset for missing values, outliers, and inconsistencies ensures optimal performance of prediction models, maintaining integrity and validity.

The following observations were made about data quality:

- The
**Insurance**column is adata type instead of a*float*.*category/object* - Undescriptive column names

**3.3.3 Hypothesis testing**

`import scipy.stats as stats`

# Select the BMI and sepsis columns from the dataset

bmi = df['Body Mass Index']

sepsis = (df['Sepssis'] == 'Positive').astype(bool).astype(int)

# Perform correlation analysis

correlation, p_value = stats.pearsonr(bmi, sepsis)

# Print the correlation coefficient and p-value

print("Correlation coefficient:", correlation)

print("P-value:", p_value)

if p_value > 0.05:

print('Fail to reject the null hypothesis. ')

else:

print('Reject the null hypothesis'))

Correlation coefficient: 0.31589377926855083

P-value: 2.3972519626653513e-15

Reject the null hypothesis

This means there is some relationship between body mass index and sepsis.

**3.3.4 Visualize and Analyze data**

We will answer the question raised from our dataset.

Rename column names to descriptive names

`df = df.rename(columns={'PRG': 'Plasma Glucose', 'PL': 'Blood Work Result-1',`

'PR': 'Blood Pressure','SK': 'Blood Work Result-2',

'TS': 'Blood Work Result-3', 'M11': 'Body Mass Index',

'BD2': 'Blood Work Result-4'})

Base on the assumptions made, some functions were used to categorize the Blood Pressure and Body Mass Index.

`# function to create a new column 'Bmi'`

def create_bmi_range(row):

if (row['Body Mass Index'] <= 18.5):

return 'Under Weight'

elif (row['Body Mass Index'] > 18.5) and (row['Body Mass Index'] <= 24.9):

return 'Healthy Weight'

elif (row['Body Mass Index'] > 24.9) and (row['Body Mass Index'] <= 29.9):

return 'Over Weight'

elif (row['Body Mass Index'] > 29.9) and (row['Body Mass Index'] < 40):

return 'Obesity'

elif row['Body Mass Index'] >= 40:

return 'Severe Obesity'

# create a function to create a new column called blood pressure ranges

def blood_pressure_ranges(row):

if row['Blood Pressure'] < 80:

return 'normal'

elif row['Blood Pressure'] >= 80 and row['Blood Pressure'] <= 89:

return 'elevated'

elif row['Blood Pressure'] >= 90:

return 'high'

**How many patients fell under weight, healthy weight, overweight, obese and severe obesity?**

- Based on the graph above, it can be observed that the majority of patients fall under the obesity category. The next highest category is overweight, followed by the healthy weight category. The underweight category has the lowest number of patients.

**Is body mass directly correlated with a patient's tendency to get sepsis?**

- Based on the graph above, it can be observed that a higher proportion of patients with obesity and severe obesity had sepsis compared to patients within the healthy weight and underweight categories. This suggests a potential association between body mass ranges and the likelihood of developing sepsis.

**How does Blood Pressure and Plasma glucose affect the Sepsis**

- In both Positive and Negative Sepsis there is almost no correlation between Plasma and Blood Pressure

*“Refer to notebook for the rest of the questions.”*

## 4. Data Preparation

In this section we will identify and handle missing or erroneous data, removing duplicates, and converting data types as necessary. Proper data preparation is critical for accurate and meaningful analysis, as the quality of the output depends heavily on the quality of the input data.

**4.1 Data preprocessing**

Data preprocessing techniques are applied to prepare the dataset for analysis, including handling missing values, normalizing features, encoding variables, and scaling the data. This involves examining statistical properties, identifying outliers, and evaluating data accuracy and completeness. By understanding and addressing data quality issues, we ensure the accuracy and reliability of the model.

*4.1.1 Converting Data Types*

- The
**Insurance**column was converted into a category data type - The
**ID**column was dropped. - The
**Sepsis**column which contains the target values for the machine learning model were changed from “positive” and “negative” to numerical values. “*Positive*” was replaced with 1 and “*Negative*” was replaced with 0.

`# Drop the ID column`

df_ = df.drop(columns=['ID'])

# convert the Insurance into a categorical column

df_['Insurance'] = df['Insurance'].astype('category')

`# change values in Sepsis column into numerical data`

df_['Sepssis'] = (df_['Sepssis'] == 'Positive').astype(bool).astype(int)

df_['Sepssis'].unique()

array([1, 0])

*4.1.2 Splitting Data into Train and Test*

The data was split into a train and test data. 80 percent of the data will be used to train our machine learning models and 20 percent of it will be used to evaluate models.

Split data into 80% train and 20% test

# Use train_test_split with a random_state, and add stratify for Classification

#spliting data into 80% train and 20% test

train, test = train_test_split(df_, test_size=0.2, random_state=42)

print(f'Train: {train.shape}, Test: {test.shape}')

Train: (479, 10), Test: (120, 10)

The target feature and train features were separated.

`# create features and targets from the train data`

X_train = train.drop(columns=['Sepssis'])

y_train = train['Sepssis'].copy()

# create features and targets from test data

X_test = test.drop(columns=['Sepssis'])

y_test = test['Sepssis'].copy()

*4.1.3 Feature Processing and Feature Engineering*

** Feature processing** involves the manipulation or transformation of individual features in the dataset. It is applying various techniques to modify the raw features or derive new features from the existing ones.

Create new features

`# get the products of the all the numerical columns`

X_train['All-Product'] = X_train['Blood Work Result-4'] * X_train['Blood Work Result-1']*\

X_train['Blood Work Result-2']* X_train['Blood Work Result-3'] * X_train['Plasma Glucose']\

*X_train['Blood Pressure'] * X_train['Age']* X_train['Body Mass Index']

- The code calculates the product of all the numerical columns and creates a new column called ‘All-Product’ in the training and testing datasets.

** Feature engineering** involves creating new features or transforming existing features to enhance the predictive power of the machine learning models. The aim is to capture relevant information and patterns from the data that may improve the model’s performance and predictive accuracy.

Create new features

# get the categories from the products of all numerical feature

blood_max = X_train['All-Product'].max()

bin_max = 3500000000000

# create a new column 'Age Group'

all_labels =['{0}-{1}'.format(i, i+500000000000) for i in range(0, round(blood_max),500000000000)]

X_train['All-Product_range'] = pd.cut(X_train['All-Product'], bins=(range(0, bin_max, 500000000000)), right=False, labels=all_labels)

print(all_labels)

# get the min and max o fthe ages

age_min = df['Age'].min()

age_max = df['Age'].max()

# create a new column 'Age Group'

age_labels =['{0}-{1}'.format(i, i+20) for i in range(0, age_max,20)]

X_train['Age Group'] = pd.cut(X_train['Age'], bins=(range(0, 120, 20)), right=False, labels=age_labels)

print(age_labels)

# get the max of the bmi

bmi_max = df['Body Mass Index'].max()

# create a new column 'Age Group'

labels =['{0}-{1}'.format(i, i+30) for i in range(0, round(bmi_max),30)]

X_train['BMI_range'] = pd.cut(X_train['Body Mass Index'], bins=(range(0, 120, 30)), right=False, labels=labels)

print(labels)

print(bmi_max)

# get the max of blood pressure

bp_max = df['Blood Pressure'].max()

# create a new column 'Age Group'

labels =['{0}-{1}'.format(i, i+50) for i in range(0, round(bp_max),50)]

X_train['BP_range'] = pd.cut(X_train['Blood Pressure'], bins=(range(0, 200, 50)), right=False, labels=labels)

X_test['BP_range'] = pd.cut(X_test['Blood Pressure'], bins=(range(0, 200, 50)), right=False, labels=labels)

print(labels)

# get max of plasma glucose

# age_min = df['Age'].min()

pg_max = df['Plasma Glucose'].max()

# create a new column 'Age Group'

labels =['{0}-{1}'.format(i, i+7) for i in range(0, round(pg_max),7)]

X_train['PG_range'] = pd.cut(X_train['Plasma Glucose'], bins=(range(0, 28, 7)), right=False, labels=labels)

- The code creates a new column called ‘All-Product_range’ by categorizing the ‘All-Product’ values into bins based on a specified range.
- The code creates a new column called ‘Age Group’ by categorizing the ‘Age’ values into bins based on a specified range.
- Another new column called ‘BMI_range’ is created by categorizing the ‘Body Mass Index’ values into bins based on a specified range.
- Similarly, a new column called ‘BP_range’ is created by categorizing the ‘Blood Pressure’ values into bins based on a specified range.
- Finally, a new column called ‘PG_range’ is created by categorizing the ‘Plasma Glucose’ values into bins based on a specified range.

*4.1.4 Handling missing values*

The missing values in both numerical and categorical features were imputed using the SimpleImputer class from the sklearn.impute framework.

- Let’s separate the numerical columns from the categorical columns.
- The missing values in both numerical and categorical columns were imputed with the most frequent value.

`# select the categorical columns from train and test data for encoding`

train_cat_cols = X_train.select_dtypes(include=['object', 'category']).columns

test_cat_cols = X_test.select_dtypes(include=['object', 'category']).columns

# train categoricak columns is the same as test categorical columns

train_cat_cols == test_cat_cols

array([ True, True, True, True, True, True])

`# impute numeical columns`

num = SimpleImputer(strategy="mean")

# impute the categorical columns

cat = SimpleImputer(strategy="most_frequent")

*4.1.5 Handling Categorical columns*

The categorical columns were encoded using the OneHotEncoder method. This encoding technique transforms categorical variables into binary vectors, allowing them to be included in the machine learning model’s numerical calculations.

`cat_encoder = OneHotEncoder()`

*4.1.6 Feature Scaling*

To standardize the numerical features and bring them to a common scale, the StandardScaler method will be applied.

`# scale the numerical features`

num_scaler = StandardScaler()

*4.1.7 Pipelines and ColumnTransformers*

- Separate pipelines were created for the categorical features and numerical features.
- Both pipelines were then fit into a ColumnTransformer to create a preprocessor pipeline.
- The preprocessor pipeline was used to fit transform train data and transform test data.
- The result was a numpy array which was then converted into a pandas dataframe

`# create variables to hold numerical and categorical columns `

num_attribs = list(train_num_cols)

cat_attribs = list(train_cat_cols)

`#create a numerical pipeline to standardize and impute the missinf in the numerical columns`

num_pipeline = Pipeline([('imputer',SimpleImputer(strategy="most_frequent")),('std_scaler', StandardScaler())])

#create a categorical pipeline to encode and impute the missing in the numerical columns

cat_pipeline = Pipeline([('imputer',SimpleImputer(strategy="most_frequent")),('cat_encoder', OneHotEncoder(handle_unknown='ignore'))])

`# Create a fullpipeline by combining numerical and catagorical pioeline`

preprocessor = ColumnTransformer([("numerical",num_pipeline, num_attribs), ("categorical",cat_pipeline, cat_attribs)], )

`# use create pipeline to transform train and test features`

X_train_prepared = preprocessor.fit_transform(X_train)

Note: Methods applied on the train data were also applied on the test data.

**4.2 Feature selection**

Feature selection is crucial for effective prediction models; by identifying the most relevant features and leveraging domain knowledge, we can improve accuracy, reduce complexity, and focus on key predictors of sepsis in the dataset.

Several columns were dropped from the training and testing datasets. The columns dropped include **‘Blood Pressure’, ‘Age’, ‘Body Mass Index’, ‘Plasma Glucose’, ‘All-Product’, ‘Blood Work Result-3’, and ‘Blood Work Result-2’**. **‘Blood Work Result-3’, and ‘Blood Work Result-2’ had very low correlation with the target.**

df_corr = df_.corr()

df_corr['Sepssis'].sort_values(ascending=False)

Sepssis 1.000000

Blood Work Result-1 0.449719

Body Mass Index 0.315894

Age 0.210234

Plasma Glucose 0.207115

Blood Work Result-4 0.181561

Blood Work Result-3 0.145892

Blood Work Result-2 0.075585

Blood Pressure 0.061086

Name: Sepssis, dtype: float64

**‘Blood Pressure’, ‘Age’, ‘Body Mass Index’, ‘Plasma Glucose’, ‘All-Product’ **were replaced with their categorical features.

`# drop columns `

X_train = X_train.drop(columns=['Blood Pressure', 'Age', 'Body Mass Index','Plasma Glucose', 'All-Product', 'Blood Work Result-3', 'Blood Work Result-2'])

X_test = X_test.drop(columns=['Blood Pressure', 'Age', 'Body Mass Index', 'Plasma Glucose', 'All-Product', 'Blood Work Result-3', 'Blood Work Result-2'])

**5. Modeling**

In this section, we aim to explore and evaluate machine learning algorithms for sepsis prediction. Our goal is to select the most effective model by assessing metrics such as accuracy, precision, area under curve and recall. This process ensures the deployment of a model that accurately detects sepsis and improves patient outcomes.

**5.1 Building and validating models**

The following are steps to be taken to build and evaluate the models:

- Instantiate the classifier.
- Train a new model.
- Check the performance on test data. Get the accuracy, recall and area under curve scores.
- Save the accuracy, recall, precision, area under cure scores in a dataframe.

Let’s create a function to plot graph, for confusion matrix, print classification report and return accuracy, precision, recall, f1 score and f2 score.

`def evaluate_model(model, x_test, y_test):`

pred = model.predict(x_test)

auc_score = roc_auc_score(y_test, model.predict_proba(X_test_df)[:, 1])

accuracy = accuracy_score(y_test, pred)

f1 = f1_score(y_test, pred)

precision = precision_score(y_test, pred)

recall = recall_score(y_test, pred)

return accuracy, f1, precision, recall, auc_score

**Logistic Regression Classifier**

Create Model

`# create model`

lgr_model = LogisticRegression()

Train Model

`# Use the .fit method`

lgr_model.fit(X_train_df, y_train)

LogisticRegression()

Evaluate Model

`accuracy, f1, precision, recall, auc_score = evaluate_model(lgr_model, X_test_df, y_test)`

accuracy, f1, precision, recall, auc_score

(0.825, 0.7469879518072289, 0.775, 0.7209302325581395, 0.8091211114466929)

**Random Forest Classifier**

Create Model

`# create a randon forest model`

rfc = RandomForestClassifier()

Train Model

`# Use the .fit method to train the model`

rfc.fit(X_train_df, y_train)

RandomForestClassifier()

Evaluate Model

`accuracy, f1, precision, recall, auc_score = evaluate_model(rfc, X_test_df, y_test)`

accuracy, f1, precision, recall, auc_score

(0.7083333333333334, 0.6153846153846155, 0.5833333333333334, 0.6511627906976745, 0.7680459075807913)

**5.2 Comparing the Model**

*Definition of Metrics*

** Accuracy: **Accuracy measures the overall correctness of the model’s predictions by calculating the ratio of correctly predicted instances to the total number of instances.

**F1 Score**: The F1 score is a measure of a model’s accuracy, combining both precision and recall. It considers both false positives and false negatives and provides a single metric that balances both precision and recall.

**Precision**: Precision measures the proportion of correctly predicted positive instances out of the total instances predicted as positive. It focuses on the model’s ability to minimize false positives, providing insights into the model’s precision in identifying true positive cases.

**Recall**: Recall, also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive instances out of the total actual positive instances. It emphasizes the model’s ability to minimize false negatives, providing insights into the model’s ability to identify all positive cases.

**AUC Score**: The Area Under the Curve (AUC) score is a performance metric used for binary classification models. It represents the model’s ability to distinguish between positive and negative instances by calculating the area under the receiver operating characteristic (ROC) curve. A higher AUC score indicates better performance in terms of distinguishing between classes.

*Interpretation of results*

In the given table, different machine learning algorithms are evaluated based on these metrics. The Logistic Regression Classifier achieved an accuracy of 0.825, an F1 score of **0.747**, precision of **0.775**, recall of** 0.721**, and an AUC score of **0.809**. The Support Vector and Random Forest Classifiers performed well as well. These 3 models will be finetuned and compared.

**5.3 Tuning Model Parameters for Better Performance**

The selected models for hyperparameter tuning are Logistic Regression, Support Vector Machine and Random Forest classifier models.

- Create a dictionary of hyperparameters.
- Provide the hyperparameter dictionary and the chosen model in GridSearchCV.
- Fit the Grid object on training data and get the results of GridSearchCV.
- Check the performance on test data. Get the accuracy, recall score and area under curve scores for test data.
- Save the accuracy, recall, precision, F1 scoreand area under curve in a dataframe.

**Hyperparameter tuning for Logistic Regression**

Set up metrics

`# # make these metrics scorers`

accuracy_score_ = make_scorer(accuracy_score)

f1_score_ = make_scorer(f1_score)

precision_score_ = make_scorer(precision_score)

recall_ = make_scorer(recall_score)

scoring = {

'accuracy': accuracy_score_,

'f1_score': f1_score_,

'precision': precision_score_,

'f2_score': f2_scorer_,

'recall': recall_

}

Parameters

`class_weights = [weight3]`

param_grid= [{'penalty': ['l1', 'l2'],\

'C': [0.001, 1, 10, 50, 80, 100], 'intercept_scaling':[1, 0.4, 5, 10],

'max_iter': [100, 500, 1000, 8, 18], 'class_weight': class_weights,

'solver':['saga', 'liblinear', 'newton-cholesky'],

'l1_ratio': np.arange(0, 1, 0.2) , 'random_state':[126, 140, 156]}]

Creating an instance of GridSeachCV and training the model

`# create a gridsearchcv to finetune the logistic regression model`

logistic_grid_search = GridSearchCV(lgr_model,param_grid, scoring=scoring, cv=10, return_train_score=True, refit='accuracy')

`# train the model`

logistic_grid_search.fit(X_train_df, y_train)

GridSearchCV(cv=10, estimator=LogisticRegression(),

param_grid=[{'C': [0.001, 1, 10, 50, 80, 100],

'class_weight': [{0: 0.75, 1: 1.8}],

'intercept_scaling': [1, 0.4, 5, 10],

'l1_ratio': array([0. , 0.2, 0.4, 0.6, 0.8]),

'max_iter': [100, 500, 1000, 8, 18],

'penalty': ['l1', 'l2'],

'random_state': [126, 140, 156],

'solver': ['saga', 'liblinear', 'newton-cholesky']}],

refit='accuracy', return_train_score=True,

scoring={'accuracy': make_scorer(accuracy_score),

'f1_score': make_scorer(f1_score),

'f2_score': make_scorer(fbeta_score, beta=2),

'precision': make_scorer(precision_score),

'recall': make_scorer(recall_score)})

Get the best estimator

`# get the best parameter from the grid`

best_lgr = logistic_grid_search.best_estimator_

best_lgr

LogisticRegression(C=1, class_weight={0: 0.75, 1: 1.8}, l1_ratio=0.0,

max_iter=8, penalty='l1', random_state=140,

solver='liblinear')

Evaluating the performance of the best estimator

`accuracy, f1, precision, recall, auc_score = evaluate_model(best_lgr, X_test_df, y_test)`

accuracy, f1, precision, recall, auc_score

(0.7083333333333334, 0.6666666666666667, 0.5645161290322581, 0.813953488372093, 0.8091211114466929)

**5.4 Comparing Fine-tuned Models**

From the graph above the Logistic Regression had the highest F2. Although the model didn’t achieve the acuracy and precision scores in our success criteria it is still our best model.

**5.5 Save the model**

The model and the ColumnTransformaer object were save using pickle.

`# Save the model and the columntransformer`

import pickle

filename = 'logistic_reg_class_model.pkl'

filename_2 = 'full_pipeline.pkl'

pickle.dump(logistic_grid_search.best_estimator_, open(filename, 'wb'))

pickle.dump(full_pipeline, open(filename_2, 'wb'))

## 6. Evaluation

**6.1 **Evaluating model performance against success criteria

The logistic regression model came out as the best model after fine-tuning.

The model performance falls short of the success criteria set for the project.

- The accuracy achieved is 70.83%, which is below the target of 75%.
- The F1 score is 0.667, indicating a moderate balance between precision and recall.
- The precision achieved is 0.0.564, which is below the target of 75%, indicating a relatively high number of false positives.
- The recall achieved is 0.814, exceeding the target of 75% and capturing a significant proportion of true positive cases.
- The AUC score is 0.809, indicating a good overall performance of the model.

## 7. Model Interpretation

Gaining insights into how a model makes predictions and understanding its decision-making process is crucial for model interpretation. By analyzing feature importance, feature effects, and model explanations, we can gain valuable insights into the inner workings of the model.

*For sample data*

shap.initjs()

explainer = shap.Explainer(best_lgr.predict, X_train_df)

shap_values = explainer(X_test_df)

`def features_explainer(index, type='condensed'):`

if type == 'condensed':

return shap.plots.force(shap_values[index])

elif type == 'waterfall':

return shap.plots.waterfall(shap_values[index])

else:

return 'Select a valid type of plot: "condensed" or "waterfall"'

features_explainer(0, type='waterfall')

From the shap plots, the most important feature for predicting the first sample/row of the test data is** Blood work results-1** followed by the **x_0–30-category** (bmi values between 0 and 30).

*For Test Data*

`# Let's plot the featureof importance for the logistic regression model`

shap.plots.bar(shap_values)

Based on the predictions of the entire test data, the top five important features for predicting the test data are Blood work results-1, x3_0–30 (bmi values between 0 and 30), Blood work results-1, x3_30–60, and Insurance_1 in descending order of importance.

## Limitations and Future directions

Future Directions for the Project:

- Addressing Data Imbalance:

- Employ techniques like oversampling or undersampling to tackle the data imbalance issue.

2. Exploring Alternative Algorithms:

- Explore different machine learning algorithms to find the ones that better capture the complexity of sepsis prediction.
- Fine-tune the parameters of these algorithms to optimize their performance.

3. Feature Engineering and Incorporating Domain Knowledge:

- Engage in further feature engineering to identify and engineer new features that have a strong relationship with sepsis.
- Incorporate domain knowledge to ensure the inclusion of relevant features that contribute to sepsis prediction.

4. Incorporating Additional Relevant Features:

- Include additional features that can enhance the predictive power of the model.
- Identify and incorporate features that align with domain expertise and have a significant impact on sepsis prediction.

By addressing these limitations and pursuing these future directions, the project can make significant advancements in predicting sepsis. This, in turn, will lead to improved patient care and outcomes in the context of sepsis management.

**Resources**

You can find the original project on **GitHub**. All feedback and corrections are welcome.