Machine Learning and Scalable Informatics Methods to Predict Disease Status from Multimodal Biomedical Data
Date of Award
Doctor of Philosophy (PhD)
Biological understanding of complex diseases such as stroke and obesity is critical for the advancement of medicine. Further knowledge discovery can provide effective biomarkers to improve disease diagnosis and prognosis, identify driver mutations, predict individual genetic susceptibility for early prevention and effective disease management, and facilitate development of personalized drugs. Stroke is the second leading cause of death and long-term disability in the world. Thus, stroke management is a time-sensitive emergency. The initial hours after stroke onset map the trajectory of subsequent neurologic complications. Cerebral edema develops hours to days after acute ischemic stroke and may result in midline shift and cerebral herniation, but only a small proportion of all stroke patients will develop this life-threatening complication. For most patients, deterioration is usually delayed by a few days after stroke, thus allowing for a window of opportunity for early detection and intervention. Decompressive hemicraniectomy, if performed within 48h and prior to deterioration, dramatically reduces mortality and improves chances of functional recovery. Accurately predicting which hemispheric stroke patients will develop malignant edema is therefore of vital importance in acute stroke care. Similarly, obesity is a risk factor in stroke severity and post-stroke recovery, and a significant risk factor for increased morbidity and mortality – most importantly from cardiovascular disease, diabetes, cancer, and other chronic diseases such as liver and kidney disease, osteoarthritis, and depression. There is a genetic overlap between complex diseases, and there has been a surge of interest to identify disease risk early in life from genetic data and discover genetic overlap to target multiple conditions simultaneously. Thus, extracting disease signatures and predicting associated biomarkers from genetic data is critical in precision medicine. To this end, big data analyses provide an opportunity to discover effective biomarkers to capture kinetics of complications after stroke and identify disease drivers to implement individualized therapies for stroke patients. Collecting multi-center biomedical data may advance our understanding of the clinical and biologic factors contributing to polygenic complications after stroke. This has led to a surge of interest and effort to collaborate on stroke research by combining imaging, clinical, and omics datasets to better understand the biology of stroke and trajectories of stroke complications. This international effort for knowledge discovery requires development of a centralized informatics platform with the capability to enable multi-center collaborates to jointly share biomedical data from hospitals and international research centers. Moreover, an advanced informatics infrastructure can facilitate development and launching of containerized data processing pipelines on scale to aggregate information from multimodal data (medical imaging data, clinical data, and biological data) and validate new knowledge on multi-center heterogeneous datasets. A centralized repository can also provide standard datasets to evaluate efficacy of pipelines on heterogeneous data before adopting the software in clinical settings as a medical device. Integration of Artificial Intelligence (AI) with computational pipelines is emerging as a game-changer for complex disease study from multimodal biomedical data. AI-based pipelines can automate medical image analysis at scale for complex phenotype discovery. Machine learning-based pipelines can also facilitate diagnosis and prognosis of neurological disorders by learning predictive patterns from multimodal heterogeneous data. Furthermore, deep learning-based methods can identify patients with genetic risk factors to polygenic disorders by extracting complex patterns in the genomics data. This dissertation explores development of scalable informatics platform and big multimodal data mining methods for phenotype discovery and disease status prediction. It is composed of 4 main sections: (1) launching a scalable informatics platform for stroke research, (2) determining incremental value of incorporating imaging‐derived features from serial CTs to enhance prediction of malignant edema, (3) evaluating whether neural networks incorporating data extracted from routine computed tomography (CT) imaging and clinical databases could enhance prediction of edema in a large diverse stroke cohort, and (4) development of a genome-wide deep learning pipeline to predict phenotype and associated pointwise uncertainty from genetic variants. To begin, the Stroke Neuro-Imaging Phenotype Repository (SNIPR) was developed as a multi-center centralized imaging repository of clinical CT and MRI scans from stroke patients worldwide. Its purpose is to manage and facilitate sharing of high value stroke imaging data sets, implement containerized automated computational methods to extract image characteristics and disease-specific features from contributed images, and facilitate integration of imaging and clinical data to perform large-scale analyses of complications after stroke. Extra data type plugins were also developed to manage associated clinical and derived imaging phenotypes in SNIPR. Next, a series of linear classifiers were developed to evaluate adding longitudinal imaging and clinical features in prediction of fatal edema. We extracted quantitative imaging features from baseline and follow‐up CTs, including CSF volume, intracranial reserve, midline shift (MLS), and infarct‐related hypodensity volume. Potentially lethal malignant edema was defined as requiring DHC or dyeing with MLS over 5‐mm. Machine‐learning models were built using logistic regression first with baseline data and then adding 24‐h data including reduction in CSF volume. Model performance was evaluated with cross‐validation using metrics of recall, precision, and area under receiver‐operating‐characteristic and precision‐recall curves. Subsequently, a long short-term memory (LSTM) model was developed to capture the nonlinear dynamic nature of fatal edema and predict mortality and the need for surgery. Fully connected and LSTM neural networks were trained using serial clinical and imaging data to predict those who would require hemicraniectomy or die with large midline shift. The performance of these models was tested in comparison with regression models and the Enhanced Detection of Edema in Malignant Anterior Circulation Stroke (EDEMA) score, using cross-validation to construct precision-recall curves. Interoperability methods were used on this LSTM model to find significant inputs in fatal edema prediction as well. The most significant biomarkers can be used to identify druggable targets in future studies. Finally, a genome-wide deep learning pipeline was developed to predict phenotype from common genetic variants, model nonlinear epistatic interaction between genetic variants, and predict pointwise uncertainty from non-genetic factors. Body Mass Index (BMI) was selected as the initial phenotype of interest to develop the methods because BMI is readily available in large public datasets, and obesity is a risk factor for stroke. Genetic variants were selected based on prior meta-analysis of genome wide association studies and linkage distance, and reference human genome was used to map variants to dosage values in UK Biobank. Significance level was relaxed to feed more variants into linear and nonlinear models to evaluate degree of association across the genome, and both linear and nonlinear models were developed to evaluate epistatic interaction between variants. A two-head neural network was then developed to predict phenotype from millions of variants in linkage equilibrium, and data uncertainty and model uncertainty were also estimated to give insight on prediction confidence and model improvement.
Daniel S. Marcus Joseph O'Sullivan
Ulugbek Kamilov, Aristeidis Sotiras, Rajat Dhar,
Available for download on Monday, August 26, 2024
Analytical, Diagnostic and Therapeutic Techniques and Equipment Commons, Artificial Intelligence and Robotics Commons, Bioimaging and Biomedical Optics Commons, Biology Commons, Radiology Commons