1 DATA.
Link: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
This is database of patient about heart disease. This data was taken by University of Switzerland and V.A. Medical Center, Long Beach and Cleveland Clinic Foundation, Hungarian Institute of Cardiology, Budapest.Each of them have different number of samples. Cleveland:303, Hungarian:294,Switzerland:123, and long beach VA:200. All attributes are numeric value. Each database has the same instance format. This databases have 76 features, all published experiments refer to using a subset of 14 of them (age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak , slope, ca, thal, diagnosis). The output is presence of heart disease in the patient from 0 to 4 (5 outputs):Coronary artery (atherosclerotic) heart disease that affects the arteries to the heart ( value 4), Valvular heart disease that affects how the valves function to regulate blood flow in and out of the hear (value 3), Cardiomyopathy that affects how the heart muscle squeezes (value 2), Heart rhythm disturbances (arrhythmias) that affect the electrical conduction (value 1), Absence of heart disease (value 0).
2 DESCRIBING FEATURE
3
PREPROCESSING
Repace ? by Nanvalue Remove Nan value Normalization Result
Random forest curve close to the perfect ROC curve have a better performance level than the ones Random forest with limited feature by using feature selection curve close to the perfect ROC curve have a better performance level than the others. Kth nearest neighbor with limited feature by using feature extraction curve close to the perfect ROC curve have a better performance level than the others.
CONCLUSION
Using random forests has produced the best performances in test error rate and having true positives/true negatives