Date of Award

Summer 8-15-2017

Author's School

Graduate School of Arts and Sciences

Author's Department

Biology & Biomedical Sciences (Computational & Systems Biology)

Degree Name

Doctor of Philosophy (PhD)

Degree Type



Biological data, such as molecular abundance measurements and protein

sequences, harbor complex hidden structure that reflects its underlying

biological mechanisms. For example, high-throughput abundance measurements

provide a snapshot the global state of a living cell, while homologous

protein sequences encode the residue-level logic of the proteins' function

and provide a snapshot of the evolutionary trajectory of the protein family.

In this work I describe algorithmic approaches and analysis software I

developed for uncovering hidden structure in both kinds of data.

Clustering is an unsurpervised machine learning technique commonly used

to map the structure of data collected in high-throughput experiments,

such as quantification of gene expression by DNA microarrays or

short-read sequencing. Clustering algorithms always yield a partitioning

of the data, but relying on a single partitioning solution can lead to

spurious conclusions. In particular, noise in the data can cause objects

to fall into the same cluster by chance rather than due to meaningful

association. In the first part of this thesis I demonstrate approaches to

clustering data robustly in the presence of noise and apply robust clustering

to analyze the transcriptional response to injury in a neuron cell.

In the second part of this thesis I describe identifying hidden specificity

determining residues (SDPs) from alignments of protein sequences descended

through gene duplication from a common ancestor (paralogs) and apply the

approach to identify numerous putative SDPs in bacterial transcription

factors in the LacI family. Finally, I describe and demonstrate a new

algorithm for reconstructing the history of duplications by which paralogs

descended from their common ancestor. This algorithm addresses the

complexity of such reconstruction due to indeterminate or erroneous

homology assignments made by sequence alignment algorithms and to the

vast prevalence of divergence through speciation over divergence through

gene duplication in protein evolution.


English (en)

Chair and Committee

Kristen M. Naegle

Committee Members

Barak A. Cohen, Justin C. Fay, James J. Havranek, Gary D. Stormo,


Permanent URL: