Prediction of diabetes risk at an early stage

8 min readDec 17, 2020

World photo created by jcomp — www.freepik.com

In this article, we will develop two machine learning models for predicting diabetes risk early on in one’s life. We will use the early-stage diabetes risk prediction dataset which can be found on the UCI Machine Learning Repository. Diabetes is a lifelong disease that keeps your body away from using insulin. Early detection of the disease can help people avoid chronic illness.

I live in sub-Saharan Africa and obtaining testing kits for common diseases like malaria and typhoid is still hard which makes people especially those with low incomes (who form the majority) not even think about taking tests for diseases like Type 2 diabetes which are now on the rise in the region. According to the Lancet, the rate of undiagnosed diabetes is high in most countries of sub-Saharan Africa, and individuals who are unaware they have the disorder are at very high risk of chronic complications. Chronic Diabetes can be controlled and even avoided through small changes in certain habits especially when detected early. The number of adults estimated to be living with diabetes in sub-Saharan Africa (SSA) in 2017 was 15.5 (9.8–27.8) million, with a regional prevalence of ~ 6%, and associated healthcare costs of USD 3.3 billion. This number is expected to grow by over 162% in the next 25 years. These metrics mean that there is a need to develop low-cost testing, especially with the symptoms.

In this article, we will develop a Support Vector Machine and Random Forest Machine learning Models. Let’s dive into the data.

Preprocessing and Exploration¶

Preprocessing is very crucial in data science as it helps one deal with the null values, remove irrelevant columns, clean the data, and create features that can help the model have a better performance.

Let’s start by importing the necessary libraries for this stage of the project. The purpose of “%matplotlib inline” is to keep the plots well organized in the notebook, we also set style using seaborn’s “set_style” to “darkgrid” to make the backgrounds of all plots have a dark background color.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
%matplotlib inline

Loading the data onto the notebook, we found that the data has 17 columns and 520 rows

data = pd.read_csv("diabetes_data_upload.csv")
data.shape
data.info()
data.head()

(520, 17)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 520 entries, 0 to 519
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Age                 520 non-null    int64 
 1   Gender              520 non-null    object
 2   Polyuria            520 non-null    object
 3   Polydipsia          520 non-null    object
 4   sudden weight loss  520 non-null    object
 5   weakness            520 non-null    object
 6   Polyphagia          520 non-null    object
 7   Genital thrush      520 non-null    object
 8   visual blurring     520 non-null    object
 9   Itching             520 non-null    object
 10  Irritability        520 non-null    object
 11  delayed healing     520 non-null    object
 12  partial paresis     520 non-null    object
 13  muscle stiffness    520 non-null    object
 14  Alopecia            520 non-null    object
 15  Obesity             520 non-null    object
 16  class               520 non-null    object
dtypes: int64(1), object(16)
memory usage: 69.2+ KB

The data is clean, has no nulls and now we can proceed to the further exploration of the data of the data

All the columns in that dataset are of type object except the age column, so we’ll look at the age column first, then proceed to the rest

Age

sns.distplot(data["Age"], kde = False, color = "r")
plt.title("Distribution of the Individuals Ages")
plt.show()

Since most of the data is categoric, to get it plotted so that we can have visualizations that show the relationships. We used the seaborn library that has plots like catplot, violinplots, and boxplots.

We used the “Age” as the dependent variable on all the plots, we also added a heuristic effect using our target variable “class” to show whether the people who tested positive or negative had the symptoms or not and vice versa. Below are some of the plots we made, the rest can be found on my GitHub.

Polyphagia

sns.catplot(x="Polyphagia", y="Age", hue="class", data=data)

We also used bar plots to get a clear plot that shows the number of people that had symptoms of Polyphagia.

For these graphs, the graph on the right shows the number of people that exhibited symptoms of Polyphagia while the one on the left shows the ones that had no symptoms of the condition. From the graphs, it’s clear that most of the people that had Polyphagia tested positive for diabetes

Delayed Healing

data[data["delayed healing"] == "No"]["class"].value_counts().plot.barh()data[data["delayed healing"] == "Yes"]["class"].value_counts().plot.barh()

These graphs represent the people that had symptoms of delayed healing and those that didn’t. The graphs look the same and almost have the same trend in both cases. These graphs indicate that this symptom doesn't have a large bearing on whether a person turns out to be positive or negative

Alopecia

This is the one case where most people with a symptom turned out to be negative

Summary of Symptoms:

1. Symptoms that are more common in the people that turned out negative: Alopecia

2. Symptoms that are more common in the people that turned out positive: Polyura, Polydipsia, sudden weight loss, Polyphagia, Irritability (However, few positive people show signs of irritability), Partial Peresis, Visual Blurring

3. Symptoms that are common in both people that turned out positive and negative (The number of individuals reflects the population, the symptoms that fall here have more positive cases because of the population): Delayed healing (has an equal number of positive and negative cases, however, for the people that turned out negative it's common among people above 45), Genital rush, muscle stiffness, obesity, Itching. These are the columns that I also dropped.

In order to get all the categoric data numerical, we carried out a dummy encoding. The dummy encoding creates two more columns for each depending on the number of categories it has.

df = pd.get_dummies(data)
df.head()

df = df.drop(columns = [ 'Gender_Female', 'Polyuria_No',        'Polydipsia_No', 'sudden weight loss_No', 'weakness_No',        'Polyphagia_No', 'Genital thrush_No', 'visual blurring_No',        'Itching_No', 'Irritability_No','delayed healing_No', 'partial paresis_No', 'muscle stiffness_No',        'Alopecia_No', 'Obesity_No', 'class_Negative'], axis = 1)

We used correlations to determine relationships and use them as proof for the visualizations that we had seen in the earlier exploration. High correlation doesn't mean causation but it shows the existence of a relationship. We also did it last as the final benchmark for the columns to be dropped

df.corr("pearson")

We dropped off all the columns that had been suggested for the drop earlier and all those that had a correlation less than 0.1 0r -0.1

dfa = df.drop(columns = ["Gender_Male","Genital thrush_Yes", "Itching_Yes", "delayed healing_Yes", "Obesity_Yes"], axis = 1)

Modeling the data

We will develop a Support Vector Machine and Random Forest Machine learning Models

Modeling 1: Support Vector Machine

First, we separated the features from the labels (target variable)

feat = dfa.drop(columns=['class_Positive'],axis=1)
label = dfa["class_Positive"]

Then, we split the data into a test set and a training set of the data. The latter is used for training the data while the test set is used for the prediction so that the parameters of the model are tuned for a more accurate prediction.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(feat, label, test_size=0.15)

This helps us to scale the data especially if we have values that are so large making computation difficult

from sklearn.preprocessing import StandardScalersc_x = StandardScaler()
X_train = sc_x.fit_transform(X_train)
X_test = sc_x.fit_transform(X_test)

Then finally, we got to the machine learning step and first, we made the necessary imports so that we can start off. My biggest surprise about machine learning is that the amount of code that we write to create the model is only three lines. However, to get to these three lines, a lot of work is involved as you have seen us run through this model.

support_vector_classifier = SVC(kernel='rbf')
support_vector_classifier.fit(X_train,y_train)
y_pred_svc = support_vector_classifier.predict(X_test)

To determine the accuracy of our model, we imported the confusion matrix which takes true labels and predicted labels as inputs and returns a matrix.

from sklearn.metrics import confusion_matrixcm_support_vector_classifier = confusion_matrix(y_test,y_pred_svc)
print(cm_support_vector_classifier,end='\n\n')[[34  0]
 [ 0 44]]

The first row and column of the matrix denote the number of positives that our model predicted correctly ( also called ‘True Positives’) and the second row and the second column denotes the number of negative labels that also our model predicted correctly (also called as ‘True Negatives’). The sum of these two numbers denotes the number of correct predictions the model made.

numerator = cm_support_vector_classifier[0][0] + cm_support_vector_classifier[1][1]
denominator = sum(cm_support_vector_classifier[0]) + sum(cm_support_vector_classifier[1])acc_svc = (numerator/denominator) * 100
print("Accuracy : ",round(acc_svc,2),"%")

Accuracy : 100.0 %

Since the train_test_split worked randomly, meaning that if we run the model again, we shall get a different accuracy. To deal with this situation, we do what is called cross-validation where we segment the data into parts and use all but one part for training and the remaining one for testing and the sklearn library has a provision for it.

Therefore, we import cross_val_score

from sklearn.model_selection import cross_val_score
cross_val_svc = cross_val_score(estimator = SVC(kernel = 'rbf'), X =X_train, y = y_train, cv = 10, n_jobs = -1)print("Cross Validation Accuracy : ",round(cross_val_svc.mean() * 100, 2),"%")

Cross-Validation Accuracy: 93.21 %

Modeling 2: Random Forest

from sklearn.ensemble import RandomForestClassifierrandom_forest_classifier = RandomForestClassifier()
random_forest_classifier.fit(X_train,y_train)
y_pred_rfc = random_forest_classifier.predict(X_test)

Confusion Matrix

cm_random_forest_classifier = confusion_matrix(y_test,y_pred_rfc)
print(cm_random_forest_classifier,end="\n\n")[[32  2]
 [ 0 44]]numerator = cm_random_forest_classifier[0][0] + cm_random_forest_classifier[1][1]
denominator = sum(cm_random_forest_classifier[0]) + sum(cm_random_forest_classifier[1])acc_rfc = (numerator/denominator) * 100print("Accuracy : ",round(acc_rfc,2),"%")Accuracy :  97.44 %

Cross-validation

cross_val_rfc = cross_val_score(estimator=RandomForestClassifier(), X=X_train, y=y_train, cv=10, n_jobs=-1)print("Cross Validation Accuracy : ",round(cross_val_rfc.mean() * 100, 2),"%")

Cross-Validation Accuracy: 96.16 %

Both models exhibited high accuracy. For the rest of the notebook find it on my Github

Conclusion

Many of the people in Sub-Saharan Africa get to know that they have diabetes after it has reached its chronic. This happens a lot because most people don't have access to a personal doctor to periodically monitor their lives for such ailments. Apps with such models can be deployed on mobile phones which are now so common in this region which can be used to help people to detect their ailment before it is too late.

Prediction of diabetes risk at an early stage

Preprocessing and Exploration¶

Modeling the data

Conclusion

Written by Simon Kirabo