Date of Award

Winter 12-15-2017

Author's School

Graduate School of Arts and Sciences

Author's Department

Biology & Biomedical Sciences (Human & Statistical Genetics)

Degree Name

Doctor of Philosophy (PhD)

Degree Type



For cancer genomics to fully expand its utility from research discovery to clinical adoption, somatic variant detection pipelines must be optimized and standardized to ensure identification of clinically relevant mutations and to reduce laborious and error-prone post-processing steps. To address the need for improved catalogues of clinically and biologically important somatic mutations, we developed DoCM, a Database of Curated Mutations in Cancer (, as described in Chapter 2. DoCM is an open source, openly licensed resource to enable the cancer research community to aggregate, store and track biologically and clinically important cancer variants. DoCM is currently comprised of 1,364 variants in 132 genes across 122 cancer subtypes, based on the curation of 876 publications. To demonstrate the utility of this resource, the mutations in DoCM were used to identify variants of established significance in cancer that were missed by standard variant discovery pipelines (Chapter 3). Sequencing data from 1,833 cases across four TCGA projects were reanalyzed and 1,228 putative variants that were missed in the original TCGA reports were identified. Validation sequencing data were produced from 93 of these cases to confirm the putative variant we detected with DoCM. Here, we demonstrated that at least one functionally important variant in DoCM was recovered in 41% of cases studied. A major bottleneck in the DoCM analysis in Chapter 3 was the filtering and manual review of somatic variants. Several steps in this post-processing phase of somatic variant calling have already been automated. However, false positive filtering and manual review of variant candidates remains as a major challenge, especially in high-throughput discovery projects or in clinical cancer diagnostics. In Chapter 4, an approach that systematized and standardized the post-processing of somatic variant calls using machine learning algorithms, trained on 41,000 manually reviewed variants from 20 cancer genome projects, is outlined. The approach accurately reproduced the manual review process on hold out test samples, and accurately predicted which variants would be confirmed by orthogonal validation sequencing data. When compared to traditional manual review, this approach increased identification of clinically actionable variants by 6.2%. These chapters outline studies that result in substantial improvements in the identification and interpretation of somatic variants, the use of which can standardize and streamline cancer genomics, enabling its use at high throughput as well as clinically.


English (en)

Chair and Committee

Obi Elaine L. Griffith Mardis

Committee Members

Timothy J. Ley, Todd E. Druley, Ron Bose, Christopher A. Maher,


Permanent URL: