Date of Award

Winter 12-15-2018

Author's School

Graduate School of Arts and Sciences

Author's Department

Biology & Biomedical Sciences (Human & Statistical Genetics)

Degree Name

Doctor of Philosophy (PhD)

Degree Type



The implementation of next-generation genomic sequencing has exploded over the past dozen years. Large consortia, such as The Cancer Genome Atlas (TCGA); the International Cancer Genetics Consortium (ICGC); and the Pediatric Cancer Genome Projects (PCGP), made great strides in democratizing big data for the scientific community. These data sets provide a rich resource to build tools for somatic variant discovery and exploratory analysis. Public repositories hold the answer to many novel biological and clinical revelations i.e., the discovery of complex indels, splice creating mutations, alternative super enhancer binding sites, machine learning models to predict mutation impact, and cancer subtype classification and identification.

At the end of 2014, seven additional cancer types and 11 different pediatric tumor cohorts were publicly available when compared to the Ding lab’s first PanCancer effort [Kandoth et al., 2013]. Motivated by the possibility of novel cancer driver gene discovery, we launched a new PanCan2 effort. We assembled sequence data from 8,018 cancer cases representing a combined 30 pediatric and adult cancer types from 8 organ systems. Analysis of the resulting data corpus identified 270 cancer-associated genes, 107 of which have not been previously reported in Pan- Cancer studies. Pediatric-enriched mutant genes (e.g., IL7R, PAX5, and H3F3A) were found in tumors from the hematopoietic and central nervous systems, consistent with their roles in early development. Distinctive mutational architectures were identified for each of the 8 organ sys- tems, reflecting the tissue of origin and likely exposure to similar environmental factors. TP53 mutant vs. TP53 wild-type tumors had largely distinct patterns of co-occurring mutations, suggesting a pivotal role of TP53 in shaping the mutational network. Cis-activation of receptor tyrosine kinases at mutational, expression, and phosphorylation levels, as well as trans-activation of hormone-related transcription factors, were identified through the integration of multiple data types. In the end, this effort did not result in a publication because we did not perform uniform variant calling across all samples and relied primarily on publicly available data sets.

Armed with the knowledge that reviewers would require a complete reboot of the TCGA variant calls before another PanCancer paper would be considered, Dr. Li Ding thoughtfully sub- mitted a proposal to acquire funding necessary for the recalling of all TCGA exome sequencing bams using many different calls. This effort is referred to as the Multi-center Mutation Calling in Multiple Cancers (MC3). TCGA cancer genomics data set includes over 10,000 tumor-normal exome pairs across 33 different cancer types, in total >400 TB of raw data files required re- analysis. A comprehensive encyclopedia of somatic mutation calls for the TCGA data was created to enable robust cross-tumor-type analyses. Our approach accounts for variance and batch effects introduced by the rapid advancement of DNA extraction, hybridization-capture, sequencing, and analysis methods over time. We present best practices for applying an ensemble of seven mutation-calling algorithms with scoring and artifact filtering. The data set created by this analysis includes 3.5 million somatic variants and forms the basis for PanCancer Atlas papers. The results have been made available to the research community along with the methods used to generate them. This project is the result of collaboration from a number of institutes and demonstrates how team science drives large genomics projects.

Having a complete overhaul of all somatic mutations available in the TCGA, we sought to use these data for a complete TCGA PanCancer analysis. However, instead of relying wholly on in-house algorithms we also performed PanSoftware analysis spanning 26 computational tools from multiple institutions to catalog driver genes and mutations. In total, 9,423 tumor exomes (comprising all 33 of TCGA projects) we identified 299 driver genes with implications regarding their anatomical sites and cancer/cell types. Sequence- and structure-based analyses identified >3,400 putative missense driver mutations supported by multiple lines of evidence. Experimental validation confirmed 60%-85% of predicted mutations as likely drivers. We found that >300 MSI tumors are associated with high PD-1/PD-L1, and 57% of tumors analyzed harbor putative clinically actionable events. Our study represents the most comprehensive discovery of cancer genes and mutations to date and will serve as a blueprint for future biological and clinical endeavors.

One of many new waves in the genomics era will be the cohesive integration of multi-omics data. At present, our current understanding of molecular processes in oncogenesis is governed by known-knowns. This is clearly illustrated in our marker paper that displays insights into cancer through the synthesis of findings from TCGA PanCancer Atlas [Ding et al., 2018]. In closing the final chapters of TCGA, we addressed three facets of oncogenesis: (1) somatic driver mutations, germline pathogenic variants, and their interactions in the tumor; (2) the influence of the tumor genome and epigenome on transcriptome and proteome; and (3) the relationship between tumor and the micro-environment, including implications for drugs targeting driver events and immunotherapies. These results will anchor future characterization of rare and common tumor types, primary and relapsed tumors, and cancers across ancestry groups and will guide the deployment of clinical genomic sequencing.

In quick succession, both The Cancer Genome Atlas and the International Cancer Genetics Consortium provided the cancer research community with consensus somatic mutation calls for captured exome sequencing created by the Multi-center Mutations Calling in Multiple Cancers effort (MC3) and whole genome sequence provided by the PanCancer (PCAWG). 746 of the samples underwent sequencing by MC3 and PCAWG. We found that that ∼80% of possible mutations in covered exomic regions matched using the two technologies. Using a statistical model we estimated that 15-30% of the unique mutations are attributable to noise caused by variant allele fraction and clonal heterogeneity. We also observed that ∼30% of the mutations uniquely identified by PCAWG could be traced to mutations made by a single caller by MC3 and are not reported in the publicly available MC3 data set. Due to the numerous modes of comparison, we built MAFit an online tool to facilitate engagement with these data. Finally, we highlight the advantages of using whole genome technologies in regions of high and low GC content and perform significantly mutated gene analysis, thus, increasing the targeted/captured exomic space by ∼50% to discover additional genes that could only be found using whole genome sequencing approach.


English (en)

Chair and Committee

Li Ding

Committee Members

Ron Bose, Michael Province, Govindan Ramaswamy, John Rice,