Applying Machine Learning on Diabetes Dataset

Abhi Bothera
9 min readDec 21, 2020


This article will depict how information identified with diabetes can be utilized to anticipate if an individual has diabetes or not. All the more explicitly, this article will concentrate on how AIML can be used to foresee illnesses, for example, diabetes. This article helps us understand concepts like data exploration, data cleansing, model selection, model evaluation, feature selection, and practically apply them.

What is Diabetes?

Diabetes is an infection that happens when the blood glucose level turns out to be high, which at last prompts other medical issues, for example, heart illnesses, kidney malady, and so forth. Diabetes is caused basically because of the utilization of exceptionally handled nourishment, lousy utilization propensities, and so forth. As indicated by the WHO, the quantity of individuals with diabetes has been expanded throughout the years.


  • Jupyter Notebook.
  • Anaconda (Scikit Learn, Numpy, Pandas, Matplotlib, Seaborn)
  • Python 3.+
  • Basic comprehension of supervised machine learning methods: especially classification.

Step 0 — Data Preparation

As a Data Scientist, the most dreary errand which we experience is the getting and the planning of an informational index. Even though there is a wealth of information at this time, it is still elusive a reasonable informational collection that suits the issue you are attempting to handle. On the off chance that there are not any reasonable informational collections to be discovered, you may need to make your own.

In this instructional exercise, we won’t make our informational collection; instead, we will utilize a current informational collection called the “Pima Indians Diabetes Database” given by the UCI Machine Learning Repository (famous storehouse for AI informational indexes). We will play out the AI work process with the Diabetes Data set gave previously.

Step 1 — Data Exploration

When experienced with an informational index, first we ought to break down and “become more acquainted with” the informational index. This progression is essential to acquaint with the information, to increase some comprehension about the potential highlights, and to check whether information cleaning is required.

To start with, we import our data set and necessary libraries to the Jupyter notebook. In the data set, the mentioned columns can be observed.

%matplotlib  inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pdimport seaborn as snsdiabetes = pd.read_csv('datasets/diabetes.csv')

The data set can be examined using the pandas’ head() method.


Dimensions of the data set can be found using the panda Dataframes’ ‘shape’ attribute.

print("Diabetes data set dimensions : {}".format(diabetes.shape))

We can see that the informational collection contains 768 lines and 9 segments. ‘Result’ is the segment which we will foresee, which says if the patient is diabetic or not. 1 method the individual is diabetic and 0 methods individual isn’t. We can distinguish that out of the 768 people, 500 are named as 0 (non-diabetic) and 268 as 1 (diabetic).


Representation of information is a basic part of information science. It gets information and furthermore to disclose the information to someone else. Python has a few fascinating perception libraries, for example, Matplotlib, Seaborn, and so forth.

In this instructional exercise, we will utilize pandas’ representation which is based on matplotlib, to discover the information appropriation of the highlights.

Following code can be used to draw histograms for the two responses separately.

diabetes.groupby(‘Outcome’).hist(figsize=(9, 9))

Step 2 — Data Cleaning

The next period of the ML work process is information cleaning. Viewed as one of the significant strides of the work process, since it can represent the deciding moment the model. There is an adage in AIML, “Better information beats fancier calculations”, which recommends better information gives you better coming about models.

Various factors that are to be considered in the data cleaning process are as follows:

  1. Irrelevant or Duplicate observations.
  2. Same category occurring multiple times, i.e, bad labeling of data.
  3. Unexpected outliers.
  4. Missing or null data points.

Missing or Null Datapoints

We can find out any missing or null data points of the data set using the following pandas function.


It can be observed that there are no data points missing in the data set.

Unexpected Outliers

When investigating the histogram we can recognize that there are a few anomalies in certain segments. We will further dissect those exceptions and figure out what we can do about them.

Blood pressure: By watching the information we can see that there are 0 qualities for circulatory strain. What’s more, it is clear that the readings of the informational index appear to be off-base in light of the fact that a living individual can’t have the diastolic circulatory strain of zero. By watching the information we can see 35 checks where the worth is 0.

print("Total : ", diabetes[diabetes.BloodPressure == 0].shape[0])Total :  35print(diabetes[diabetes.BloodPressure == 0].groupby('Outcome')['Age'].count())Outcome
0 19
1 16
Name: Age, dtype: int64

Plasma glucose levels: Indeed, even in the wake of fasting glucose level would not be as low as zero. Therefore zero is an invalid perusing. By watching the information we can see 5 checks where the worth is 0.

print("Total : ", diabetes[diabetes.Glucose == 0].shape[0])Total :  5print(diabetes[diabetes.Glucose == 0].groupby('Outcome')['Age'].count())Total :  5
0 3
1 2
Name: Age, dtype: int64

Skin Fold Thickness: For normal people, skinfold thickness can’t be less than 10 mm better yet zero. Total count where value is 0: 227.

print("Total : ", diabetes[diabetes.SkinThickness == 0].shape[0])Total :  227print(diabetes[diabetes.SkinThickness == 0].groupby('Outcome')['Age'].count())Outcome
0 139
1 88
Name: Age, dtype: int64

BMI: Ought not to be 0 or near zero except if the individual is extremely underweight which could be perilous.

print("Total : ", diabetes[diabetes.BMI == 0].shape[0])Total :  11print(diabetes[diabetes.BMI == 0].groupby('Outcome')['Age'].count())Outcome
0 9
1 2
Name: Age, dtype: int64

Insulin: In an uncommon circumstance an individual can have zero insulin yet by watching the information, we can find that there is an aggregate of 374 checks.

print("Total : ", diabetes[diabetes.Insulin == 0].shape[0])Total :  374print(diabetes[diabetes.Insulin == 0].groupby('Outcome')['Age'].count())Outcome
0 236
1 138
Name: Age, dtype: int64

Here are a few different ways to deal with invalid information esteems :

  1. Disregard/evacuate these cases: This isn’t really conceivable as a rule since that would mean losing important data. What’s more, for this situation “skin thickness” and “insulin” sections means have a lot of invalid focuses. In any case, it may work for “BMI”, “glucose “and “pulse” information focuses.

2. Put normal/mean qualities: This may work for certain informational indexes, however for our situation putting a mean an incentive to the pulse section would send an off-base sign to the model.

3. Abstain from utilizing highlights: It is conceivable to not utilize the highlights with a lot of invalid qualities for the model. This may work for “skin thickness” yet it is difficult to foresee that.

Before the finish of the information cleaning process, we have arrived at the resolution that this given informational index is inadequate. Since this is a show for AI we will continue with the given information with some minor changes.

We will expel the lines in which the “blood pressure”, “BMI” and “Glucose” are zero.

diabetes_mod = diabetes[(diabetes.BloodPressure != 0) & (diabetes.BMI != 0) & (diabetes.Glucose != 0)]print(diabetes_mod.shape)(724, 9)

Step 3 — Feature Engineering

Highlight designing is the way toward changing the accumulated information into highlights that better speak to the issue that we are attempting to fathom to the model, to improve its exhibition and precision.

Highlight designing makes more information highlights from the current highlights and furthermore join a few highlights to deliver increasingly instinctive highlights to sustain the model.

“ Feature designing empowers to feature the significant highlights and encourage to expedite area ability the issue to the table. It additionally permits to abstain from overfitting the model in spite of giving many info highlights”.

The space of the issue we are attempting to handle requires loads of related highlights. Since the informational index is as of now given, and by looking at the information we can’t further make or reject any information now. In the informational collection, we have the accompanying highlights.

By an unrefined perception, we can say that the ‘Skin Thickness’ isn’t a pointer of diabetes. In any case, we can’t deny the way that it is unusable now.

In this way, we will utilize every one of the highlights accessible. We separate the informational collection into highlights and the reaction that we will foresee. We will dole out the highlights to the X variable and the reaction to the y variable.

feature_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']X = diabetes_mod[feature_names]
y = diabetes_mod.Outcome

By and large element designing is performed before choosing the model. Anyway, for this instructional exercise, we pursue an alternate methodology. At first, we will use every one of the highlights gave in the informational collection to the model, we will return to highlights designing to talk about the component significance of the chose model.

Step 4 — Model Selection

Model determination or calculation choice stage is the most energizing and the core of AIML. It is where we select the model which performs best for the informational collection within reach.

First, we will ascertain the “Arrangement Accuracy (Testing Accuracy)” of a given arrangement of grouping models with their default parameters to figure out which model performs better with the diabetes informational collection.

We will import the vital libraries to the note pad. We import seven classifiers, to be precise, Random Forest, K-Nearest Neighbors, Logistic Regression, Support Vector Classifier, Gaussian Naive Bayes, and Gradient Boost to be nominees for the best classifier.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

The classifier models can be initialized with their default parameters and add them to a model list.

models = []models.append(('KNN', KNeighborsClassifier()))
models.append(('SVC', SVC()))
models.append(('LR', LogisticRegression()))
models.append(('DT', DecisionTreeClassifier()))
models.append(('GNB', GaussianNB()))
models.append(('RF', RandomForestClassifier()))
models.append(('GB', GradientBoostingClassifier()))

Evaluation Methods

It is a general practice to abstain from preparing and testing on similar information. The reasons are that the objective of the model is to anticipate out-of-test information, and the model could be excessively perplexing prompting overfitting. To keep away from the previously mentioned issues, there are two safeguards.

  1. Train/Test Split
  2. K-Fold Cross-Validation

For the train/test split, we will import “train_test_split” and for K-Fold Cross-Validation, we will import “cross_val_score” . “accuracy_score” is to assess the exactness of the model in the train/test split strategy.

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

The above-mentioned methods will be performed to find the best performing base, models.

K-Fold Cross-Validation

This strategy parts the informational index into K equivalent allotments (“folds”), at that point utilize 1 crease as the testing set and the association of different overlays as the preparation set. At that point, the model is tried for precision. The procedure will pursue the above advances K times, utilizing diverse overlay as the testing set each time. The normal testing exactness of the procedure is trying precision.

Pros: An increasingly precise gauge of out-of-test exactness. Increasingly “productive” utilization of information (each perception is utilized for both preparing and testing).

Cons: Much more slow than Train/Test split.

K-Fold Cross Validation with Scikit Learn :

We will push ahead with K-Fold cross-approval as it is increasingly precise and utilize the information proficiently. We will prepare the models utilizing 10 overlay cross-approval and compute the mean exactness of the models. “cross_val_score” gives its own preparation and precision computation interface.

scores = []for name, model in models:
names = []

score = cross_val_score(model, X, y, cv=kfold, scoring='accuracy').mean()
kfold = KFold(n_splits=10, random_state=10)
scores.append(score)kf_cross_val = pd.DataFrame({'Name': names, 'Score': scores})

Accuracy scores can be plotted using seaborn:

axis = sns.barplot(x = 'Name', y = 'Score', data = kf_cross_val)
axis.set(xlabel='Classifier', ylabel='Accuracy')
for p in axis.patches:
height = p.get_height()
axis.text(p.get_x() + p.get_width()/2, height + 0.005, '{:1.4f}'.format(height), ha="center")

We can see the Gaussian Naive Bayes, Logistic Regression, Gradient Boosting, and Random Forest have performed superior to the rest. From the base level, we can see that the Logistic Regression performs superior to different calculations.


In this article we talked about the essential AIML work process steps, for example, information investigation, information cleaning steps, feature engineering basics, and model selection using Scikit Learn library.