predicting heart disease with machine learning - Part1
About 20% of the deaths in the U.S. are caused by some form of heart disease. Early detection and intervention can have a significant positive impact on these outcomes. Late in 2022, my father had to have an emergency quadruple bypass that certainly saved his life. He was extremely fortunate that he decided on his own to do a stress test. This is not an option for many Americans and with that in mind I decided to use machine learning to build and test a variety of models to help identify the most at risk people based on some straightforward information.
The data used in this initial attempt was collected by the CDC from a survey in 2015. I used a version of this data from Kaggle that was cleaned by Alex Teboul. The dataset can be found here.
Project Goals:
Perform EDA on the data sourced from Kaggle.
Create various machine learning models and check their performance.
Create some nice data visualizations along the way.
Explain my process in detail.
The Data:
Data was posted to Kaggle by Alex Teboul and listed as '253,680 survey responses from cleaned BRFSS 2015 - binary classification'
While this data is already cleaned, I am interested in pulling raw data from the CDC for more recent years for the purpose of cleaning the raw data and having additional data to test/train the model on.
As a voluntary survey we will need to be particularly aware of biases in the data. This has been cleaned of null values so it will be important to check the distributions of the values.
2a. Description of indicator/features from the survey:
Response Variable / Dependent Variable:
Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI) --> HeartDiseaseorAttack
Independent Variables:
High Blood Pressure
Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional --> HighBP
High Cholesterol
Have you EVER been told by a doctor, nurse or other health professional that your blood cholesterol is high? --> HighChol
Cholesterol check within past five years --> CholCheck
BMI
Body Mass Index (BMI) --> BMI
Smoking
Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] --> Smoker
Other Chronic Health Conditions
(Ever told) you had a stroke. --> Stroke
(Ever told) you have diabetes (If "Yes" and respondent is female, ask "Was this only when you were pregnant?". If Respondent says pre-diabetes or borderline diabetes, use response code 4.) --> Diabetes
Physical Activity
Adults who reported doing physical activity or exercise during the past 30 days other than their regular job --> PhysActivity
Diet
Consume Fruit 1 or more times per day --> Fruits
Consume Vegetables 1 or more times per day --> Veggies
Alcohol Consumption
Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) --> HvyAlcoholConsump
Health Care
Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service? --> AnyHealthcare
Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? --> NoDocbcCost
Health General and Mental Health
Would you say that in general your health is: --> GenHlth
Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? --> MentHlth
Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? --> PhysHealth
Do you have serious difficulty walking or climbing stairs? --> DiffWalk
Demographics
Indicate sex of respondent. --> Sex
Fourteen-level age category --> Age
What is the highest grade or year of school you completed? --> Education
Is your annual household income from all sources: (If respondent refuses at any income level, code "Refused.") --> Income
Below I have included a render of the Jupyter Notebook where I did my initial training. The full repo is available on GitHub.