Visualization and Statistics - Diabetes Analysis

Context

Diabetes is one of the most frequent diseases worldwide and the number of diabetic patients are growing over the years. The main cause of diabetes remains unknown, yet scientists believe that both genetic factors and environmental lifestyle play a major role in diabetes.

A few years ago research was done on a tribe in America which is called the Pima tribe (also known as the Pima Indians). In this tribe, it was found that the ladies are prone to diabetes very early. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients were females at least 21 years old of Pima Indian heritage.

Objective

Here, we are analyzing different aspects of Diabetes in the Pima Indians tribe by doing Exploratory Data Analysis.

Data Dictionary

The dataset has the following information:

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: A function that scores the likelihood of diabetes based on family history.
Age: Age in years
Outcome: Class variable (0: a person is not diabetic or 1: a person is diabetic)

Q 1: Import the necessary libraries and briefly explain the use of each library (3 Marks)

# remove _____ & write the appropriate library name

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Write your Answer here:

Ans 1: Numpy is a library used for mathematical operations with arrays and matrices. Pandas is a library used to extract, manipulate and clean dataframes (datasets). Matplotlib is MATLAB-based library used to visualize data. Seaborn is also a library used for visualization with more elaborated tools compare to Matplotlib.

Q 2: Read the given dataset (1 Mark)

# remove _____ & write the appropriate function name

pima = pd.read_csv("diabetes.csv")

Q3. Show the last 10 records of the dataset. How many columns are there? (1 Mark)

# remove ______ and write the appropriate number in the function

pima.tail(10)

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
758	1	106	76	20	79	37.5	0.197	26	0
759	6	190	92	20	79	35.5	0.278	66	1
760	2	88	58	26	16	28.4	0.766	22	0
761	9	170	74	31	79	44.0	0.403	43	1
762	9	89	62	20	79	22.5	0.142	33	0
763	10	101	76	48	180	32.9	0.171	63	0
764	2	122	70	27	79	36.8	0.340	27	0
765	5	121	72	23	112	26.2	0.245	30	0
766	1	126	60	20	79	30.1	0.349	47	1
767	1	93	70	31	79	30.4	0.315	23	0

Write your Answer here:

Ans 3: 9 columns

Q4. Show the first 10 records of the dataset (1 Mark)

# remove _____ & write the appropriate function name and the number of rows to get in the output

pima.head(10)

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	79	33.600000	0.627	50	1
1	1	85	66	29	79	26.600000	0.351	31	0
2	8	183	64	20	79	23.300000	0.672	32	1
3	1	89	66	23	94	28.100000	0.167	21	0
4	0	137	40	35	168	43.100000	2.288	33	1
5	5	116	74	20	79	25.600000	0.201	30	0
6	3	78	50	32	88	31.000000	0.248	26	1
7	10	115	69	20	79	35.300000	0.134	29	0
8	2	197	70	45	543	30.500000	0.158	53	1
9	8	125	96	20	79	31.992578	0.232	54	1

Q5. What do you understand by the dimension of the dataset? Find the dimension of the `pima` dataframe. (1 Mark)

# remove _____ & write the appropriate function name

pima.shape

(768, 9)

Write your Answer here:

Ans 5: The dataframe has 768 rows and 9 columns.

Q6. What do you understand by the size of the dataset? Find the size of the `pima` dataframe. (1 Mark)

# remove _____ & write the appropriate function name

pima.size

Write your Answer here:

Ans 6: 768 rows x 9 columns = 6912 dataframe size.

Q7. What are the data types of all the variables in the data set? (2 Marks)

Hint: Use the info() function to get all the information about the dataset.

# remove _____ & write the appropriate function name

pima.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

Write your Answer here:

Ans 7: All variables are numerical. Two are type float (BMI and DiabetesPedigreeFunction), the others are integer type.

Q8. What do we mean by missing values? Are there any missing values in the `pima` dataframe? (2 Marks)

# remove _____ & write the appropriate function name

pima.isnull().values.any()

False

Write your Answer here:

Ans 8: By missing values we mean that in the dataframe there are cells that do not contain any piece of information. With the above command we check whether pima dataframe has missing values. isnull() is a method that for each cell returns True when the cell is empty. The attribute values returns an array with all the Boolean values. With the method any() we are able to show that there are not missing values in pima because otherwise it would have returned True is any item was True.

Q9. What do the summary statistics of the data represent? Find the summary statistics for all variables except ‘Outcome’ in the `pima` data. Take one column/variable from the output table and explain all its statistical measures. (3 Marks)

# remove _____ & write the appropriate function name

pima.iloc[:,0:8].describe()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	121.675781	72.250000	26.447917	118.270833	32.450805	0.471876	33.240885
std	3.369578	30.436252	12.117203	9.733872	93.243829	6.875374	0.331329	11.760232
min	0.000000	44.000000	24.000000	7.000000	14.000000	18.200000	0.078000	21.000000
25%	1.000000	99.750000	64.000000	20.000000	79.000000	27.500000	0.243750	24.000000
50%	3.000000	117.000000	72.000000	23.000000	79.000000	32.000000	0.372500	29.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000

Write your Answer here:

Ans 9: Summary statistics of the data represent descriptive statistics on how the data is distributed across the sample: the mean, standard deviation, minimum, maximum, and quartiles (including median). For example, for the variable Age we have 768 values. The mean is around 33 years old and the standard deviation can be approximated by 12. The minimum age is 21 years old while the maximum age is 81. 25% statistic stands for first quartile and it describes how 25% of age values are between 21 years old and 24 years old. 50% statistic is the median while 75% statistic is the third quartile. That means that IQR (interquartile range) that includes 50% of the age values is between 41 years old and 24 years old (17 years of range). Because the mean is greater than the median, the data distibution is right skewed.

Q 10. Plot the distribution plot for the variable ‘BloodPressure’. Write detailed observations from the plot. (2 Marks)

# remove _____ & write the appropriate library name

sns.displot(pima['BloodPressure'], kind='kde')
plt.show()

png

Write your Answer here:

Ans 10: The Blood Pressure distribution is almost symmetrical around the value 70. It tends to a bell-shape form, so the curve is almost a Gaussian curve. Probably the mode is very similar to the mean and median.

Q 11. What is the ‘BMI’ of the person having the highest ‘Glucose’? (1 Mark)

# remove _____ & write the appropriate function name

pima[pima['Glucose']==pima['Glucose'].max()]['BMI']

661    42.9
Name: BMI, dtype: float64

Write your Answer here:

Ans 11: The person with the highest glucose has a Body Mass Index (BMI) of 42.9. According to cdc.gov that person is obese.

Q12.

12.1 What is the mean of the variable ‘BMI’?

12.2 What is the median of the variable ‘BMI’?

12.3 What is the mode of the variable ‘BMI’?

12.4 Are the three measures of central tendency equal?

(3 Marks)

# remove _____ & write the appropriate function name

m1 = pima['BMI'].mean()  # mean
print(m1)
m2 = pima['BMI'].median()  # median
print(m2)
m3 = pima['BMI'].mode()[0]  # mode
print(m3)

45080515543617
0
0

Write your Answer here:

Ans 12: The BMI mean is around 32. The median and the mode are also 32. That means that the three measures of central tendency are equal suggesting that the BMI distribution is Gaussian.

Q13. How many women’s ‘Glucose’ levels are above the mean level of ‘Glucose’? (1 Mark)

# remove _____ & write the appropriate function name

pima[pima['Glucose']>pima['Glucose'].mean()].shape[0]

Write your Answer here:

Ans 13: In the dataframe 343 Pima women have a glucose level above the mean.

Q14. How many women have their ‘BloodPressure’ equal to the median of ‘BloodPressure’ and their ‘BMI’ less than the median of ‘BMI’? (2 Marks)

# remove _____ & write the appropriate column name

pima[(pima['BloodPressure']==pima['BloodPressure'].median()) & (pima['BMI']<pima['BMI'].median())].shape[0]

Write your Answer here:

Ans 14: In the dataframe 22 Pima women have a blood pressure equal to the median value and the BMI smaller than the BMI median.

Q15. Create a pairplot for the variables ‘Glucose’, ‘SkinThickness’, and ‘DiabetesPedigreeFunction’. Write your observations from the plot. (4 Marks)

# remove _____ & write the appropriate function name

sns.pairplot(data=pima,vars=['Glucose', 'SkinThickness', 'DiabetesPedigreeFunction'], hue='Outcome')
plt.show()

png

Write your Answer here:

Ans 15: There are several observations to make. First of all, the glucose distribution of diabetic Pima women can be distinguished from the non-diabetic women distribution. It is harder to distinguish the outcome variable from the skin thickness and diabetes pedigree function distributions. At first sight, the glucose does not seem to have a high correlation with the skin thickness. There is a high value on the top right of the graph but it could be an outlier. The glucose level has a more positive correlation with the diabetes pedigree function but probably not extremely high. Anyway, both the glucose vs skin thickness and glucose vs pedigree function shows distinguishable sub-groups based on the outcome variable. Last, the skinthickness does not seem to have a great correlation with the diabetes pedigree function and the graph does not differentiate very well the outcome variable.

Q16. Plot the scatterplot between ‘Glucose’ and ‘Insulin’. Write your observations from the plot. (2 Marks)

# remove _____ & write the appropriate function name

sns.scatterplot(x='Glucose',y='Insulin',data=pima)
plt.show()

png

Write your Answer here:

Ans 16: From this scatterplot we can see how the glucose is positively correlated with the insulin level. However, there is an odd situation when the insulin value is around 100 where it looks like many glucose values correspond to that insuline value. This should be checked more carefully before to end up with conclusions.

Q 17. Plot the boxplot for the ‘Age’ variable. Are there outliers? (2 Marks)

# remove _____ & write the appropriate function and column name 

plt.boxplot(pima['Age'])

plt.title('Boxplot of Age')
plt.ylabel('Age')
plt.show()

png

Write your Answer here:

Ans 17: Yes, there are outliers and they are all above the value that represents 1.5 times the interquartile range.

Q18. Plot histograms for the ‘Age’ variable to understand the number of women in different age groups given whether they have diabetes or not. Explain both histograms and compare them. (3 Marks)

# remove _____ & write the appropriate function and column name

plt.hist(pima[pima['Outcome']==1]['Age'], bins = 5)
plt.title('Distribution of Age for Women who has Diabetes')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

png

# remove _____ & write the appropriate function and column name

plt.hist(pima[pima['Outcome']==0]['Age'], bins = 5)
plt.title('Distribution of Age for Women who do not have Diabetes')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

png

Write your Answer here:

Ans 18: Both histograms seem to suggest that the age population of non-diabetic Pima women and diabetic Pima women follow an exponential distribution. However, the non-diabetic women sample is more distributed across younger ages compared to the diabetic histogram. In fact, the most common age for non-diabetic women lies in the younger range (around 20-30 years old) while the diabetic women age is more spread out but it keeps a pick for young adulthood. This confirms that Pima women tend to be diagnosed with diabetes early. Also, the non-diabetic young women might have diabetes when they are older as the first histogram describes but it remains more common to be diagnosed at younger age than older age.

Q 19. What is the Interquartile Range of all the variables? Why is this used? Which plot visualizes the same? (2 Marks)

# remove _____ & write the appropriate variable name

Q1 = pima.quantile(0.25)
Q3 = pima.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Pregnancies                  5.0000
Glucose                     40.5000
BloodPressure               16.0000
SkinThickness               12.0000
Insulin                     48.2500
BMI                          9.1000
DiabetesPedigreeFunction     0.3825
Age                         17.0000
Outcome                      1.0000
dtype: float64

Write your Answer here:

Ans 19: The Interquartile Range (IQR) is a descriptive statistic for variability. It represents the interval where 50% of the values are and it includes the median. In other words, from the first quartile to the third quartile. It can be used to detect outliers: a common practice is to check whether the values are within 1.5 times the Interquartile range, otherwise they can be detected as outliers. The plot to visualize IQR is the boxplot: the actual rectangular figure represents IQR.

Q 20. Find and visualize the correlation matrix. Write your observations from the plot. (3 Marks)

# remove _____ & write the appropriate function name and run the code

corr_matrix = pima.iloc[:,0:8].corr()

corr_matrix

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
Pregnancies	1.000000	0.128022	0.208987	0.009393	-0.018780	0.021546	-0.033523	0.544341
Glucose	0.128022	1.000000	0.219765	0.158060	0.396137	0.231464	0.137158	0.266673
BloodPressure	0.208987	0.219765	1.000000	0.130403	0.010492	0.281222	0.000471	0.326791
SkinThickness	0.009393	0.158060	0.130403	1.000000	0.245410	0.532552	0.157196	0.020582
Insulin	-0.018780	0.396137	0.010492	0.245410	1.000000	0.189919	0.158243	0.037676
BMI	0.021546	0.231464	0.281222	0.532552	0.189919	1.000000	0.153508	0.025748
DiabetesPedigreeFunction	-0.033523	0.137158	0.000471	0.157196	0.158243	0.153508	1.000000	0.033561
Age	0.544341	0.266673	0.326791	0.020582	0.037676	0.025748	0.033561	1.000000

# remove _____ & write the appropriate function name

plt.figure(figsize=(8,8))
sns.heatmap(corr_matrix, annot = True)

# display the plot
plt.show()

png