Date of Award
8-6-2024
Degree Name
Doctor of Philosophy (PhD)
Degree Type
Dissertation
Abstract
Traditionally, machine learning (ML) is used to train a model to predict scores for instances that were not seen by the model during training. In this conventional use of ML, the model is trained to learn general patterns that relate features to labels, so it can predict accurate scores for unseen data. Here, we use ML, unconventionally, to integrate different types of noisy data. Specifically, for biological investigations, in which there is no mean to measure a ground truth. We propose to train a ML model to predict scores on the same instances that were used in its training. In this setup, features are derived from one, noisy data type, and labels are derived from another, noisy, data type. Each of the data types presents a different facet of the biological signal. None of the data type is the ground truth; yet each presents a noisy proxy of it. During training, the model learns to memorize a consensus of patterns from features and labels, and hence a consensus of patterns from the first and second data types. So, when it is fed the same training instances, it predicts integrated scores that are influenced by the first and second data types. We showed that these predicted, integrated scores are less noisy than features and labels that were used in training the model. We used this new integration concept in different biological and genetic applications. In Chapter 2 and 3, we applied it to infer TF network maps for Saccharomyces cerevisiae and Cryptococcus neoformans. In Chapter 4, we applied it to predict gene expression levels that we used to improve gene-trait associations.
Language
English (en)
Chair
Michael Brent