Date of Award

Summer 8-15-2017

Author's School

Graduate School of Arts and Sciences

Author's Department

Biology & Biomedical Sciences (Immunology)

Degree Name

Doctor of Philosophy (PhD)

Degree Type



Modern metagenomic methods have rapidly accelerated the rate of viral discovery. Currently, to discover a novel virus, deep sequencing reads must align to a known reference virus. While alignment is effective at identifying closely related viruses, highly divergent viruses can often share no discernable sequence alignment with known viruses. Therefore, the accurate classification of viral dark matter – metagenomic sequences that originate from viruses but do not align to any reference virus sequences – is one of the major obstacles in not only discovering novel viruses, but also by extension, comprehensively defining the virome. As viral dark matter results fundamentally from a failure to align sequence reads, two major contributors to viral dark matter include 1) the lack of diversity in specific viral families and 2) the reliance on alignment as a metric to define viral taxonomy. In this dissertation, I address each of these issues. These projects resulted in a massive expansion in understanding of microbial virus diversity, which led me to further interrogate the biology of microbial viruses. Specifically, I attempted to identify novel antiviral mechanisms against RNA bacteriophages and possibly identify a novel family of RNA bacteriophages.

First, I addressed the underrepresentation of viral sequences in databases by identifying a specific underrepresented class of virus, bacteriophages with RNA genomes, and systematically discovered highly divergent novel RNA bacteriophages in previously sequenced data. I identified 161 partial genome sequences from at least 122 RNA bacteriophage phylotypes that are highly divergent from each other and from previously described RNA bacteriophages. These partial genome sequences displayed multiple novel genome organizations previously unknown for RNA bacteriophages, and in aggregate, encoded 91 open reading frames (ORFs) that did not align to any known protein; sequences related to these ORFs would be described as viral dark matter in absentia of this systematic discovery effort.

This new level RNA bacteriophage diversity suggested that RNA bacteriophages might be major predators of bacteria in the environment. In turn, this would suggest that there might be active resistance mechanisms in bacteria that specifically antagonize RNA bacteriophages; as of now however, there are no active mechanisms known in bacteria that can antagonize RNA bacteriophages. Therefore, one goal was to identify bacterial genes that can restrict RNA bacteriophage infection. I performed a functional metagenomic screen to identify RNA phage resistance genes. From this, I identified four genes that conferred resistance to the RNA phages, Qβ and MS2 but not the RNA phage C1.

Additionally, this expansion of RNA bacteriophage diversity suggests that there might be new families of RNA bacteriophages that are unrelated to the previously discovered RNA bacteriophages. One candidate eukaryotic viral family that might in fact be RNA bacteriophages are Picobirnaviridae. Picobirnaviruses are bisegmented RNA viruses that are highly prevalent in stool. By analyzing previously sequenced datasets, I discovered multiple new picobirnavirus segments. From analyzing the upstream regions of the ORFs on these segments, I found that almost all of the ORFs are preceded by a bacterial ribosomal binding sequence. This conservation of bacterial ribosomal binding sequences suggests that these viruses might infect bacteria. I then unsuccessfully tried to show that Human Picobirnavirus can replicate in bacterial cells.

Second, I addressed the reliance on alignment based algorithms by developing a novel alignment-independent algorithm to identify viral sequences. This algorithm, DiscoVir, is a support vector machine (SVM) model that relies on nucleotide k-mer frequencies to discriminate sequences of novel, highly disparate eukaryotic viruses from prokaryotic and fungal sequences. I validated in silico that DiscoVir can identify viruses from novel viral taxa and that it outperforms BLASTx for almost all viral families. When applied to an authentic metagenomic dataset, DiscoVir identified two additional contigs that corresponded to two undetected segments of a novel bunya-like virus. By selectively culturing fungi from this serum sample, I identified an isolate of Penicillium atramentosum that contained all three viral RNA segments, thus suggesting that this fungal isolate was in fact the host of this novel virus. I sequenced the whole genome of this novel virus and demonstrated that the terminal nucleotide sequences were conserved between the three segments, and these sequences were consistent with the termini of bunyaviruses in the genera Phlebovirus and Tenuivirus. Thus, application of DiscoVir played a critical role in the identification of the first segmented negative stranded RNA virus infection of a fungus.

Taken together, I have contributed to the systematic reduction of viral dark matter using two different approaches, both of which enable future researchers to identify a much more diverse repertoire of viruses than previously possible. This increased ability to identify highly divergent viruses will better enable the metagenomics community to accurately identify the role of viruses in larger biological processes, including but not limited to, human disease.


English (en)

Chair and Committee

David Wang

Committee Members

Daved Fremont, Michael Diamond, Josh S. Swamidass, Gautam Dantas,


Permanent URL: