predicting heart disease with machine learning - Part1

About 20% of the deaths in the U.S. are caused by some form of heart disease. Early detection and intervention can have a significant positive impact on these outcomes. Late in 2022, my father had to have an emergency quadruple bypass that certainly saved his life. He was extremely fortunate that he decided on his own to do a stress test. This is not an option for many Americans and with that in mind I decided to use machine learning to build and test a variety of models to help identify the most at risk people based on some straightforward information.

The data used in this initial attempt was collected by the CDC from a survey in 2015. I used a version of this data from Kaggle that was cleaned by Alex Teboul. The dataset can be found here.

Project Goals:

  • Perform EDA on the data sourced from Kaggle.

  • Create various machine learning models and check their performance.

  • Create some nice data visualizations along the way.

  • Explain my process in detail.

The Data:

  • Data was posted to Kaggle by Alex Teboul and listed as '253,680 survey responses from cleaned BRFSS 2015 - binary classification'

  • While this data is already cleaned, I am interested in pulling raw data from the CDC for more recent years for the purpose of cleaning the raw data and having additional data to test/train the model on.

  • As a voluntary survey we will need to be particularly aware of biases in the data. This has been cleaned of null values so it will be important to check the distributions of the values.

2a. Description of indicator/features from the survey:

  • Response Variable / Dependent Variable:

    • Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI) --> HeartDiseaseorAttack

  • Independent Variables:

    • High Blood Pressure

      • Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional --> HighBP

    • High Cholesterol

      • Have you EVER been told by a doctor, nurse or other health professional that your blood cholesterol is high? --> HighChol

      • Cholesterol check within past five years --> CholCheck

    • BMI

      • Body Mass Index (BMI) --> BMI

    • Smoking

      • Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] --> Smoker

    • Other Chronic Health Conditions

      • (Ever told) you had a stroke. --> Stroke

      • (Ever told) you have diabetes (If "Yes" and respondent is female, ask "Was this only when you were pregnant?". If Respondent says pre-diabetes or borderline diabetes, use response code 4.) --> Diabetes

    • Physical Activity

      • Adults who reported doing physical activity or exercise during the past 30 days other than their regular job --> PhysActivity

    • Diet

      • Consume Fruit 1 or more times per day --> Fruits

      • Consume Vegetables 1 or more times per day --> Veggies

    • Alcohol Consumption

      • Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) --> HvyAlcoholConsump

    • Health Care

      • Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service? --> AnyHealthcare

      • Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? --> NoDocbcCost

    • Health General and Mental Health

      • Would you say that in general your health is: --> GenHlth

      • Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? --> MentHlth

      • Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? --> PhysHealth

      • Do you have serious difficulty walking or climbing stairs? --> DiffWalk

    • Demographics

      • Indicate sex of respondent. --> Sex

      • Fourteen-level age category --> Age

      • What is the highest grade or year of school you completed? --> Education

      • Is your annual household income from all sources: (If respondent refuses at any income level, code "Refused.") --> Income

Below I have included a render of the Jupyter Notebook where I did my initial training. The full repo is available on GitHub.

Previous
Previous

Exploring the Foundations of NLP: Key Techniques