Author's School

Graduate School of Arts & Sciences

Author's Department/Program

Biology and Biomedical Sciences: Computational and Systems Biology


English (en)

Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)

Chair and Committee

Justin C Fay


With the recent advance in sequencing technology, there have been growing interests in developing new methods to predict disease-causing alleles in a personal genome by integrating functional evidences from sequence conservation, genome-wide association studies and the transcriptional regulatory network. However, even in protein-coding regions, it is not well understood how often and by what mechanism deleterious alleles disrupting strong sequence conservation can become common in population frequency and affect complex traits in humans. Moreover, in non-coding regions, even for known disease-causing genes, it is not clear how sequence conservation can be combined with functional genomic data to predict underlying disease-causing variants.

To address the first question, I developed a new likelihood ratio test for sequence conservation to predict deleterious missense alleles in the human genome. By applying the new test to three personal genomes, I find that the presence of only 10% of common deleterious SNPs can be explained by false positives due to multiple hypothesis testing, violation of evolutionary model assumptions, recent gene duplication and relaxation of selective constraints on biological processes. Next, by applying the likelihood ratio test to a general human population, I find that both computationally predicted deleterious SNPs and known disease-associated alleles are enriched within genomic regions that have been influenced by positive selection in the recent past. The observed pattern agrees with the prediction that deleterious alleles can dragged along to higher-than-expected allele frequencies due to the genetic linkage with beneficial alleles by the hitchhiking effect.

Second, I developed an integrative strategy to predict disease-causing non-coding variants in FSH receptor, a gene known to be associated with preterm birth, as a proof of principle. I sequenced protein-coding and conserved non-coding regions in preterm and term mothers, and conducted fine-mapping and transcription factor binding site analysis to narrow down the causal non-coding variants. Here, I find that in non-coding regions the causal variants can be resolved better by accounting for the expected effects of binding site mutations on the transcription regulatory network in addition to sequence conservation.

These results indicate that the comparative genomics will provide the new opportunity to explore deleterious and disease-causing genetic variation at an unprecedentedly high resolution across the genome and in a population especially if functional genomics can be integrated with comparative genomics.


This work is not available online per the author’s request. For access information, please contact or visit

Permanent URL: