Learning on the Graph: Link Prediction, Multi-label Learning, and Applications to Integrative Complex Disease Studies
Date of Award
Doctor of Philosophy (PhD)
Most of the common human diseases, such as cancer, diabetes, and Alzheimer's disease, are consequences of the abnormality of multiple cellular components and the perturbation of their intricate interactions. The emerging network-based approaches offer a unique framework for understanding the underlying molecular mechanism of human diseases thanks to their ability to characterize the complex associations between the cellular components and disease. However, the real-world applications of these network-based approaches still face many pressing challenges. The data that represent the known associations and interactions of biological entities remain extremely sparse. How to learn on networks involving with incomplete data has not been well studied. Moreover, many new types of data are still continuously growing. Issues of representing and integrating heterogeneous data are complicated by the ongoing growth of large-scale data. In this thesis, we focus on developing enabling network-based approaches for effectively integrating biological data characterized by inherent heterogeneity and incompleteness.
We first investigate the link prediction problem underlying many important applications in the studies of biological networks. We propose a novel link prediction algorithm, Marginalized Denoising Model (MDM), which explicitly acknowledges the data incompleteness. MDM casts the link prediction problem into a matrix ``denoising'' problem by modeling a function that can reconstruct the target ``complete'' graph from the observed ``incomplete''. We then propose the low-rank Marginalized Denoising for Link Prediction (lrMDLP) method, which follows the same idea of MDM, but can be scaled to large-scale networks. Experiments on a variety of real-world networks demonstrate that lrMDLP enjoys comparable superior performance on relatively small datasets, and can be readily applied to networks with millions of nodes.
Motivated by many biological problems such as protein function and disease gene prediction, we then investigate the multi-label learning problem on graphs. Recognizing its close relationship with the link prediction problem, we propose the Marginalized Label and Link Prediction (MLLP) algorithm, which jointly learn both tasks under a unified framework. We show that such joint optimization can help reduce the impact of data incompleteness and improve the performance of both tasks.
To demonstrate the effectiveness of network-based approaches on integrative disease studies, we introduce the module-guided Random Forests (mgRF) to integrate both genotypic and gene expression data from a obesity study with the help of community structures of co-expression network. Then we applied lrMDLP on a large-scale heterogeneous network to incorporate the associations among genes, diseases, drugs, and pathways. lrMDLP helps to mine the hidden associations between a pair of different biological entities. The results of a case study on Alzheimer's disease indicate that our approach can not only identify new disease-related genes, discover novel disease-disease associations, but also may offer insights into the understanding of the molecular basis of human diseases.
Yixin Chen, Sanmay Das, Nan Lin, Kilian Q Weinberger