Abstract
Biological data, such as molecular abundance measurements and proteinsequences, harbor complex hidden structure that reflects its underlyingbiological mechanisms. For example, high-throughput abundance measurementsprovide a snapshot the global state of a living cell, while homologousprotein sequences encode the residue-level logic of the proteins' functionand provide a snapshot of the evolutionary trajectory of the protein family.In this work I describe algorithmic approaches and analysis software Ideveloped for uncovering hidden structure in both kinds of data.Clustering is an unsurpervised machine learning technique commonly usedto map the structure of data collected in high-throughput experiments,such as quantification of gene expression by DNA microarrays orshort-read sequencing. Clustering algorithms always yield a partitioningof the data, but relying on a single partitioning solution can lead tospurious conclusions. In particular, noise in the data can cause objectsto fall into the same cluster by chance rather than due to meaningfulassociation. In the first part of this thesis I demonstrate approaches toclustering data robustly in the presence of noise and apply robust clusteringto analyze the transcriptional response to injury in a neuron cell.In the second part of this thesis I describe identifying hidden specificitydetermining residues (SDPs) from alignments of protein sequences descendedthrough gene duplication from a common ancestor (paralogs) and apply theapproach to identify numerous putative SDPs in bacterial transcriptionfactors in the LacI family. Finally, I describe and demonstrate a newalgorithm for reconstructing the history of duplications by which paralogsdescended from their common ancestor. This algorithm addresses thecomplexity of such reconstruction due to indeterminate or erroneoushomology assignments made by sequence alignment algorithms and to thevast prevalence of divergence through speciation over divergence throughgene duplication in protein evolution.
Committee Chair
Kristen M. Naegle
Committee Members
Barak A. Cohen, Justin C. Fay, James J. Havranek, Gary D. Stormo,
Degree
Doctor of Philosophy (PhD)
Author's Department
Biology & Biomedical Sciences (Computational & Systems Biology)
Document Type
Dissertation
Date of Award
Summer 8-15-2017
Language
English (en)
DOI
https://doi.org/10.7936/K7NC60M8
Recommended Citation
Sloutsky, Roman, "Robust Algorithms for Detecting Hidden Structure in Biological Data" (2017). Arts & Sciences Theses and Dissertations. 1215.
The definitive version is available at https://doi.org/10.7936/K7NC60M8
Comments
Permanent URL: https://doi.org/10.7936/K7NC60M8