Date of Award

Summer 8-15-2019

Author's School

Graduate School of Arts and Sciences

Author's Department

Biology & Biomedical Sciences (Molecular Cell Biology)

Degree Name

Doctor of Philosophy (PhD)

Degree Type



PubMed hosts nearly ~30 million abstracts that include both clinical and molecular research in a wide variety of contexts. The increased accessibility to the -omic technologies accelerates scientific discoveries and leads to more publications annually. One confusion in this growth has been the unequal distribution of resources for discoveries. For example, a large fraction of publications and grants on PubMed has only discussed a subset of human genes while leaving out the majority of the genome in the dark. Moreover, focused attention on understudied human genes was also lacking before the work in this dissertation. As a result, we provide a framework to even the playing field for understudied human genes to take a shot at landing on PubMed.We first applied text mining analysis to extract information about human genes from PubMed abstracts. We found that as high as 50% of human genes can be considered understudied and therefore, are orphan genes. In contrast, a very small subset of genes that are already studied (e.g. top-cited) dominate the majority of publication increase in recent years. We also observed that recently de-orphanized genes shared a common set of features that explain their pattern of discoveries, such as an association to top-cited genes and occurrences in Mendelian or GWAS studies. Based on this set of features, we developed Molecular ORphan PHEnOtype MatchEr (MORPHEOME), a computational and molecular framework to assist researchers to map the orphan genes to existing genes and biology.To generate the genome-wide data for MORPHEOME, we carried out multiple different CRISPRi and CRISPRa screens against bisphosphonate, SSRIs, biguanides, and other commonly prescribed blockbuster drugs. We also combined existing CRISPR knockout datasets, protein-protein interaction datasets, and TCGA survival/mouse knockout phenotypes to map orphan genes to their most relevant top-cited genes. We found that already published gene pairs share a high overlap in their protein-protein interaction and genetic fitness relationships. Similarly, we validated a set of novel orphan gene-top-cited protein-protein interaction based on the strength of interaction predictions. Notably, we also followed up on the related work for two recently identified orphan genes, ATRAID and SLC37A3, which were novel targets of bisphosphonates. MORPHEOME predicted a novel link between ATRAID/SLC37A3 genes and mitochondrial localized genes at the intersection of mevalonate pathway. Our follow-up work began to understand the role of these two genes in the mevalonate pathway with mitochondria. These findings show how an unbiased approach from PubMed abstracts and genetic screens revealed previously unknown biology and can serve as a guide for other unknown orphan genes.


English (en)

Chair and Committee

Timothy R. Peterson

Committee Members

Harrison Gabel, Ira Hall, Jason Held, Ting Wang,


Permanent URL:

Available for download on Saturday, August 29, 2026