Date of Award

Spring 5-2022

Author's School

McKelvey School of Engineering

Author's Department

Computer Science & Engineering

Degree Name

Master of Science (MS)

Degree Type



Applying machine learning and statistical analysis on traditionally informatics problems is a growing area of research that can result in clinicians being better-able to predict disease outcomes and create more personalized levels of care. In this study, several machine learning models are used to model the likelihood of metastasis in breast cancer patients using a mix of data from the electronic health record and socioeconomic information derived from the Area Deprivation Index (ADI). Metastasis is a late-stage disease progression in a cancer diagnosis where a tumor spreads from its initial development point to another part of the body. In breast cancer, the most diagnosed cancer in the United States, more research is needed to assess what characteristics in breast cancer-diagnosed patients may result in a metastasis The electronic health record (EHR) has emerged as a vast source of information for researchers, despite its primary usage purpose for billing. While demographic and clinical information is commonly logged in the EHR, socioeconomic information is generally unavailable. Information from social deprivation indices can be mapped to patients using geographical information such as zip codes. Social determinants of health (SDoH) are the characteristics of the environment and population that people live in, and studies have shown that living in areas with greater social disadvantage result in more adverse health outcomes. Hence, the focus of this research is two-fold. The first is to model metastasis prediction using a variety of machine learning models and assess which types perform best on the data engineered. The second is to assess, given the evidence that suggests that socioeconomic indicators contribute to health outcomes prediction, how predictive such values are in models that contain clinical information which traditionally have been the main predictors of health outcomes. In this study, tree-based algorithms such as Random Forest and XGBoost had the greatest predictive performance, but within those models scores that measure health using other comorbidities and other clinical variables overshadow the performance of the Area Deprivation Index scores engineered at the 5-digit zip code level. What follows is a discussion of the model performances and evaluation metrics, as well as an analysis of each variable’s contribution using calculations given by scikit-learn and the Shapley additives method. Another key discussion that emerges from this research is on how social deprivation indices can be best optimized for studies that model disease and what possibilities exist for use of indices at different levels of geographic summary.


English (en)


Dr. Philip R.O. Payne, PhD

Committee Members

Dr. Graham Colditz, PhD Dr. Chenyang Lu, PhD Dr. Alvitta Ottley, PhD

Included in

Engineering Commons