Heart Disease Prediction in Python

Predicting whether a person has a ‘Heart Disease’ or ‘No Heart Disease’. This is an example of Supervised Machine Learning as the output is already known.

It is a Classification Problem. As we have to classify the outcome into 2 classes:

1(ONE) as having Heart Disease and

0(Zero) as not having Heart Disease.

Where to get the Dataset

Heart Disease is a data set available in UCI repository as well as can be downloaded from Kaggle the link is provided below:

https://www.kaggle.com/priyanka841/heart-disease-prediction-uci

DataSet Description

There are 14 features(Columns) including the target. The data set includes features like:

slope_of_peak_exercise_st_segment (type: int): the slope of the peak exercise ST segment, an electrocardiography read out indicating quality of blood flow to the heart
thal (type: categorical): results of thallium stress test measuring blood flow to the heart, with possible values normal, fixed_defect, reversible_defect
resting_blood_pressure (type: int): resting blood pressure
chest_pain_type (type: int): chest pain type (4 values)
num_major_vessels (type: int): number of major vessels (0-3) colored by flourosopy
fasting_blood_sugar_gt_120_mg_per_dl (type: binary): fasting blood sugar > 120 mg/dl
resting_ekg_results (type: int): resting electrocardiographic results (values 0,1,2)
serum_cholesterol_mg_per_dl (type: int): serum cholestoral in mg/dl
oldpeak_eq_st_depression (type: float): oldpeak = ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms
sex (type: binary): 0: female, 1: male
age (type: int): age in years
max_heart_rate_achieved (type: int): maximum heart rate achieved (beats per minute)
exercise_induced_angina (type: binary): exercise-induced chest pain (0: False, 1: True)

Let’s Now Start the Data Loading

We Use Pandas for reading files.
df.head() by default shows 5 top values of the data set.
try df.tail() to see last 5 values of the data set.

Checking the Null Values and we find that we do not have any null values in the dataset.

Here, we check for any null values and find that there are no null values.

To get the entire information about the null values and the data types of the features.

We find that we do not have null values, and hence it is not required on our part to fill these values. The data type is numeric for all the features, this implies that we do not have to change the data into dummy variables or apply one hot encoding etc. before applying any algorithm.

You can access the Jupyter Notebook here:

file:///C:/Users/HP/Downloads/Heart%20Disease.html

Well , let’s move on further with Data Visualization and Analysis

To visualize the relationship between different features and figure out any linear relation between them we take help
of PAIRPLOTS. I will give the link for the code and also the video explaining the data set in the end.

with Histograms we can see the shape of each feature and provides the count of number of observations in each bin.

Box and Whiskers plot are useful to find out outliers in our data. If we have more outliers we will have to remove them or fix them; otherwise they will become as noise for the training data.

Let’s Visualize the features and their relation with the target( Heart Disease or No Heart Disease)

we do scaling to bring all the values to the same magnitude. Scaling or Standardization is brings the mean to zero and standard deviation to ‘one’. It assumes a Gaussian distribution. We perform Scaling to avoid biased predictions.

Comparison Between Unscaled and Scaled DataFrame

AFTER SCALING THE DATA NOW LOOKS LIKE THIS

Compare the above two, the first one is without scaling and the second one is after scaling

Next let us prepare our data for Training

Applying Logistic Regression Algorithm and finding the accuracy, precision and recall of the model.

Precision is the fraction of heart diseases that were predicted to be heart diseases and were actually heart diseases.

Whereas, Recall measures the fraction of true cases of Heart Disease that were detected. It also takes into account those values which were incorrectly rejected by the algorithm. We observe that the recall for ‘1’ i.e having heart disease is higher that means that the algorithm is incorrectly rejecting a few cases.

The formula is given by:

I have performed the similar test on various algorithms and found that Logistic Regression followed by Naive Bayes are giving good accuracy with 92% and 90% respectively.

This Post Has 5 Comments

Pingback: Understanding Confusion Matrix – Machine Learning
sirine 21 Aug 2020 Reply

very interesting ! thank u good job <3
1. priancaasharma 25 Aug 2020 Reply
  
  Thanks Sirine 🙂
Shaik Ahmadbasha 24 Jan 2021 Reply

Thanks a lot mam…..🥰🥰
1. Anonymous 31 Jan 2021 Reply
  
  Very Interesting

Making Data Science Easy for Beginners

Heart Disease Prediction in Python