Date of Award

Winter 12-15-2016

Author's School

Graduate School of Arts and Sciences

Author's Department

Biology & Biomedical Sciences (Human & Statistical Genetics)

Degree Name

Doctor of Philosophy (PhD)

Degree Type



The diagnosis of rare, idiopathic diseases is emerging as a primary application of medical genome sequencing. However, the application of standard tools from genetic epidemiology for many of these cases is frustrated by a combination of small sample sizes, genetic heterogeneity and the large number of singleton variants found by genome sequencing. In response, we have developed a statistical inference framework that is optimized for identifying unusual functional variation from a single genome, what we refer to as the "n-of-one" problem. Our statistical framework addresses the n-of-one problem in two steps, first by scoring single nucleotide variants (SNVs) according to their predicted pathogenicity, and then by calculating a test statistic based on these predictions and evaluating the significance using null distributions parameterized with population genetic data from healthy individuals. To implement this framework, we first evaluated the performance of several pathogenicity prediction methods, including a logistic regression based model developed in-house, before selecting the Combined Annotation Dependent Depletion (CADD) score as our metric for scoring pathogenicity. Next, population genetic data from over 60,000 individuals from the Exome Aggregation Consortium (ExAC) was used to parameterize population genetic models and generate gene-specific null models of the healthy population for three distinct disease inheritance models. Using this approach we assess our ability to identify the causal genotypes in over 5 million simulated cases of Mendelian disease, finding that 39% of disease genotypes are the most damaging unit in a typical exome background. We applied our approach to several cohorts of rare disease, including 129 n-of-one families from the Undiagnosed Diseases Program, nominating 60% of 30 genotype or genotype pairs determined to be diagnostic by a standard clinical workup as the most likely candidate in that family. We show that our approach provides a powerful way to include population databases in an integrative analysis by combining population sampling probabilities with gene expression data to improve the detection of UDP diagnostic genes. Currently, our method can produce well calibrated p-values when applied to single genomes, and, with further work, could become a widely used epidemiological method, like linkage analysis or GWAS. We further demonstrate the utility of population genetic data by leveraging pair genotype and gene expression data available through the Genotype-Tissue expression (GTEx) consortium to develop a framework to identify copy number variants (CNVs) that can modulate gene expression. Due to their size studies have found that an individual CNV is capable of disrupting the expression of multiple genes. Our framework leverages this finding to increase our power to detect expression modulating CNVs that are observed in only a single donor. We then characterize the expression modulating CNVs identified with our framework and discuss ways in which this model could be used to extend our n-of-one framework to CNVs and datasets that would be appropriate for analysis with this proposed framework.


English (en)

Chair and Committee

Donald F. Conrad

Committee Members

Justin C. Fay, Christina A. Gurnett, Jeff Gill, Ira M. Hall


Permanent URL: https://doi.org/10.7936/K7KS6Q0D

Available for download on Tuesday, December 15, 2116

Included in

Biology Commons