Skip to the content.

Visualization and Statistics - Diabetes Analysis


Context


Diabetes is one of the most frequent diseases worldwide and the number of diabetic patients are growing over the years. The main cause of diabetes remains unknown, yet scientists believe that both genetic factors and environmental lifestyle play a major role in diabetes.

A few years ago research was done on a tribe in America which is called the Pima tribe (also known as the Pima Indians). In this tribe, it was found that the ladies are prone to diabetes very early. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients were females at least 21 years old of Pima Indian heritage.


Objective


Here, we are analyzing different aspects of Diabetes in the Pima Indians tribe by doing Exploratory Data Analysis.


Data Dictionary


The dataset has the following information:

Q 1: Import the necessary libraries and briefly explain the use of each library (3 Marks)

# remove _____ & write the appropriate library name

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Write your Answer here:

Ans 1: Numpy is a library used for mathematical operations with arrays and matrices. Pandas is a library used to extract, manipulate and clean dataframes (datasets). Matplotlib is MATLAB-based library used to visualize data. Seaborn is also a library used for visualization with more elaborated tools compare to Matplotlib.

Q 2: Read the given dataset (1 Mark)

# remove _____ & write the appropriate function name

pima = pd.read_csv("diabetes.csv")

Q3. Show the last 10 records of the dataset. How many columns are there? (1 Mark)

# remove ______ and write the appropriate number in the function

pima.tail(10)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
758 1 106 76 20 79 37.5 0.197 26 0
759 6 190 92 20 79 35.5 0.278 66 1
760 2 88 58 26 16 28.4 0.766 22 0
761 9 170 74 31 79 44.0 0.403 43 1
762 9 89 62 20 79 22.5 0.142 33 0
763 10 101 76 48 180 32.9 0.171 63 0
764 2 122 70 27 79 36.8 0.340 27 0
765 5 121 72 23 112 26.2 0.245 30 0
766 1 126 60 20 79 30.1 0.349 47 1
767 1 93 70 31 79 30.4 0.315 23 0

Write your Answer here:

Ans 3: 9 columns

Q4. Show the first 10 records of the dataset (1 Mark)

# remove _____ & write the appropriate function name and the number of rows to get in the output

pima.head(10)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 79 33.600000 0.627 50 1
1 1 85 66 29 79 26.600000 0.351 31 0
2 8 183 64 20 79 23.300000 0.672 32 1
3 1 89 66 23 94 28.100000 0.167 21 0
4 0 137 40 35 168 43.100000 2.288 33 1
5 5 116 74 20 79 25.600000 0.201 30 0
6 3 78 50 32 88 31.000000 0.248 26 1
7 10 115 69 20 79 35.300000 0.134 29 0
8 2 197 70 45 543 30.500000 0.158 53 1
9 8 125 96 20 79 31.992578 0.232 54 1

Q5. What do you understand by the dimension of the dataset? Find the dimension of the pima dataframe. (1 Mark)

# remove _____ & write the appropriate function name

pima.shape
(768, 9)

Write your Answer here:

Ans 5: The dataframe has 768 rows and 9 columns.

Q6. What do you understand by the size of the dataset? Find the size of the pima dataframe. (1 Mark)

# remove _____ & write the appropriate function name

pima.size
6912

Write your Answer here:

Ans 6: 768 rows x 9 columns = 6912 dataframe size.

Q7. What are the data types of all the variables in the data set? (2 Marks)

Hint: Use the info() function to get all the information about the dataset.

# remove _____ & write the appropriate function name

pima.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

Write your Answer here:

Ans 7: All variables are numerical. Two are type float (BMI and DiabetesPedigreeFunction), the others are integer type.

Q8. What do we mean by missing values? Are there any missing values in the pima dataframe? (2 Marks)

# remove _____ & write the appropriate function name

pima.isnull().values.any()
False

Write your Answer here:

Ans 8: By missing values we mean that in the dataframe there are cells that do not contain any piece of information. With the above command we check whether pima dataframe has missing values. isnull() is a method that for each cell returns True when the cell is empty. The attribute values returns an array with all the Boolean values. With the method any() we are able to show that there are not missing values in pima because otherwise it would have returned True is any item was True.

Q9. What do the summary statistics of the data represent? Find the summary statistics for all variables except ‘Outcome’ in the pima data. Take one column/variable from the output table and explain all its statistical measures. (3 Marks)

# remove _____ & write the appropriate function name

pima.iloc[:,0:8].describe()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 121.675781 72.250000 26.447917 118.270833 32.450805 0.471876 33.240885
std 3.369578 30.436252 12.117203 9.733872 93.243829 6.875374 0.331329 11.760232
min 0.000000 44.000000 24.000000 7.000000 14.000000 18.200000 0.078000 21.000000
25% 1.000000 99.750000 64.000000 20.000000 79.000000 27.500000 0.243750 24.000000
50% 3.000000 117.000000 72.000000 23.000000 79.000000 32.000000 0.372500 29.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000

Write your Answer here:

Ans 9: Summary statistics of the data represent descriptive statistics on how the data is distributed across the sample: the mean, standard deviation, minimum, maximum, and quartiles (including median). For example, for the variable Age we have 768 values. The mean is around 33 years old and the standard deviation can be approximated by 12. The minimum age is 21 years old while the maximum age is 81. 25% statistic stands for first quartile and it describes how 25% of age values are between 21 years old and 24 years old. 50% statistic is the median while 75% statistic is the third quartile. That means that IQR (interquartile range) that includes 50% of the age values is between 41 years old and 24 years old (17 years of range). Because the mean is greater than the median, the data distibution is right skewed.

Q 10. Plot the distribution plot for the variable ‘BloodPressure’. Write detailed observations from the plot. (2 Marks)

# remove _____ & write the appropriate library name

sns.displot(pima['BloodPressure'], kind='kde')
plt.show()

png

Write your Answer here:

Ans 10: The Blood Pressure distribution is almost symmetrical around the value 70. It tends to a bell-shape form, so the curve is almost a Gaussian curve. Probably the mode is very similar to the mean and median.

Q 11. What is the ‘BMI’ of the person having the highest ‘Glucose’? (1 Mark)

# remove _____ & write the appropriate function name

pima[pima['Glucose']==pima['Glucose'].max()]['BMI']
661    42.9
Name: BMI, dtype: float64

Write your Answer here:

Ans 11: The person with the highest glucose has a Body Mass Index (BMI) of 42.9. According to cdc.gov that person is obese.

Q12.

12.1 What is the mean of the variable ‘BMI’?

12.2 What is the median of the variable ‘BMI’?

12.3 What is the mode of the variable ‘BMI’?

12.4 Are the three measures of central tendency equal?

(3 Marks)

# remove _____ & write the appropriate function name

m1 = pima['BMI'].mean()  # mean
print(m1)
m2 = pima['BMI'].median()  # median
print(m2)
m3 = pima['BMI'].mode()[0]  # mode
print(m3)
32.45080515543617
32.0
32.0

Write your Answer here:

Ans 12: The BMI mean is around 32. The median and the mode are also 32. That means that the three measures of central tendency are equal suggesting that the BMI distribution is Gaussian.

Q13. How many women’s ‘Glucose’ levels are above the mean level of ‘Glucose’? (1 Mark)

# remove _____ & write the appropriate function name

pima[pima['Glucose']>pima['Glucose'].mean()].shape[0]
343

Write your Answer here:

Ans 13: In the dataframe 343 Pima women have a glucose level above the mean.

Q14. How many women have their ‘BloodPressure’ equal to the median of ‘BloodPressure’ and their ‘BMI’ less than the median of ‘BMI’? (2 Marks)

# remove _____ & write the appropriate column name

pima[(pima['BloodPressure']==pima['BloodPressure'].median()) & (pima['BMI']<pima['BMI'].median())].shape[0]
22

Write your Answer here:

Ans 14: In the dataframe 22 Pima women have a blood pressure equal to the median value and the BMI smaller than the BMI median.

Q15. Create a pairplot for the variables ‘Glucose’, ‘SkinThickness’, and ‘DiabetesPedigreeFunction’. Write your observations from the plot. (4 Marks)

# remove _____ & write the appropriate function name

sns.pairplot(data=pima,vars=['Glucose', 'SkinThickness', 'DiabetesPedigreeFunction'], hue='Outcome')
plt.show()

png

Write your Answer here:

Ans 15: There are several observations to make. First of all, the glucose distribution of diabetic Pima women can be distinguished from the non-diabetic women distribution. It is harder to distinguish the outcome variable from the skin thickness and diabetes pedigree function distributions. At first sight, the glucose does not seem to have a high correlation with the skin thickness. There is a high value on the top right of the graph but it could be an outlier. The glucose level has a more positive correlation with the diabetes pedigree function but probably not extremely high. Anyway, both the glucose vs skin thickness and glucose vs pedigree function shows distinguishable sub-groups based on the outcome variable. Last, the skinthickness does not seem to have a great correlation with the diabetes pedigree function and the graph does not differentiate very well the outcome variable.

Q16. Plot the scatterplot between ‘Glucose’ and ‘Insulin’. Write your observations from the plot. (2 Marks)

# remove _____ & write the appropriate function name

sns.scatterplot(x='Glucose',y='Insulin',data=pima)
plt.show()

png

Write your Answer here:

Ans 16: From this scatterplot we can see how the glucose is positively correlated with the insulin level. However, there is an odd situation when the insulin value is around 100 where it looks like many glucose values correspond to that insuline value. This should be checked more carefully before to end up with conclusions.

Q 17. Plot the boxplot for the ‘Age’ variable. Are there outliers? (2 Marks)

# remove _____ & write the appropriate function and column name 

plt.boxplot(pima['Age'])

plt.title('Boxplot of Age')
plt.ylabel('Age')
plt.show()

png

Write your Answer here:

Ans 17: Yes, there are outliers and they are all above the value that represents 1.5 times the interquartile range.

Q18. Plot histograms for the ‘Age’ variable to understand the number of women in different age groups given whether they have diabetes or not. Explain both histograms and compare them. (3 Marks)

# remove _____ & write the appropriate function and column name

plt.hist(pima[pima['Outcome']==1]['Age'], bins = 5)
plt.title('Distribution of Age for Women who has Diabetes')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

png

# remove _____ & write the appropriate function and column name

plt.hist(pima[pima['Outcome']==0]['Age'], bins = 5)
plt.title('Distribution of Age for Women who do not have Diabetes')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

png

Write your Answer here:

Ans 18: Both histograms seem to suggest that the age population of non-diabetic Pima women and diabetic Pima women follow an exponential distribution. However, the non-diabetic women sample is more distributed across younger ages compared to the diabetic histogram. In fact, the most common age for non-diabetic women lies in the younger range (around 20-30 years old) while the diabetic women age is more spread out but it keeps a pick for young adulthood. This confirms that Pima women tend to be diagnosed with diabetes early. Also, the non-diabetic young women might have diabetes when they are older as the first histogram describes but it remains more common to be diagnosed at younger age than older age.

Q 19. What is the Interquartile Range of all the variables? Why is this used? Which plot visualizes the same? (2 Marks)

# remove _____ & write the appropriate variable name

Q1 = pima.quantile(0.25)
Q3 = pima.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
Pregnancies                  5.0000
Glucose                     40.5000
BloodPressure               16.0000
SkinThickness               12.0000
Insulin                     48.2500
BMI                          9.1000
DiabetesPedigreeFunction     0.3825
Age                         17.0000
Outcome                      1.0000
dtype: float64

Write your Answer here:

Ans 19: The Interquartile Range (IQR) is a descriptive statistic for variability. It represents the interval where 50% of the values are and it includes the median. In other words, from the first quartile to the third quartile. It can be used to detect outliers: a common practice is to check whether the values are within 1.5 times the Interquartile range, otherwise they can be detected as outliers. The plot to visualize IQR is the boxplot: the actual rectangular figure represents IQR.

Q 20. Find and visualize the correlation matrix. Write your observations from the plot. (3 Marks)

# remove _____ & write the appropriate function name and run the code

corr_matrix = pima.iloc[:,0:8].corr()

corr_matrix
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
Pregnancies 1.000000 0.128022 0.208987 0.009393 -0.018780 0.021546 -0.033523 0.544341
Glucose 0.128022 1.000000 0.219765 0.158060 0.396137 0.231464 0.137158 0.266673
BloodPressure 0.208987 0.219765 1.000000 0.130403 0.010492 0.281222 0.000471 0.326791
SkinThickness 0.009393 0.158060 0.130403 1.000000 0.245410 0.532552 0.157196 0.020582
Insulin -0.018780 0.396137 0.010492 0.245410 1.000000 0.189919 0.158243 0.037676
BMI 0.021546 0.231464 0.281222 0.532552 0.189919 1.000000 0.153508 0.025748
DiabetesPedigreeFunction -0.033523 0.137158 0.000471 0.157196 0.158243 0.153508 1.000000 0.033561
Age 0.544341 0.266673 0.326791 0.020582 0.037676 0.025748 0.033561 1.000000
# remove _____ & write the appropriate function name

plt.figure(figsize=(8,8))
sns.heatmap(corr_matrix, annot = True)

# display the plot
plt.show()

png

Write your Answer here:

Ans 20: In this heatmap it is evident how the skin thickness is enough highly correlated with the BMI. With the same order of correlation, pregnancies are positively correlated with age. This is obvious because a woman can have more babies if she gets older. The blood pressure is also correlated positively with age and BMI. As seen in the answer 15, glucose has very low positive correlation with diabetes pedigree function and skin thickness, but it has a good level of positive correlation with insuline. I think this value could have been even more positive if it wasn’t for the odd phenomenon I describe in answer 16. It is also evident the positive correlation of glucose with age and BMI. In my opinion all the other correlations are not large enough to be significant, or maybe that insignificance is still a piece of information to be verified with further analysis and future studies.