Abstract

Biological data, such as molecular abundance measurements and proteinsequences, harbor complex hidden structure that reflects its underlyingbiological mechanisms. For example, high-throughput abundance measurementsprovide a snapshot the global state of a living cell, while homologousprotein sequences encode the residue-level logic of the proteins' functionand provide a snapshot of the evolutionary trajectory of the protein family.In this work I describe algorithmic approaches and analysis software Ideveloped for uncovering hidden structure in both kinds of data.Clustering is an unsurpervised machine learning technique commonly usedto map the structure of data collected in high-throughput experiments,such as quantification of gene expression by DNA microarrays orshort-read sequencing. Clustering algorithms always yield a partitioningof the data, but relying on a single partitioning solution can lead tospurious conclusions. In particular, noise in the data can cause objectsto fall into the same cluster by chance rather than due to meaningfulassociation. In the first part of this thesis I demonstrate approaches toclustering data robustly in the presence of noise and apply robust clusteringto analyze the transcriptional response to injury in a neuron cell.In the second part of this thesis I describe identifying hidden specificitydetermining residues (SDPs) from alignments of protein sequences descendedthrough gene duplication from a common ancestor (paralogs) and apply theapproach to identify numerous putative SDPs in bacterial transcriptionfactors in the LacI family. Finally, I describe and demonstrate a newalgorithm for reconstructing the history of duplications by which paralogsdescended from their common ancestor. This algorithm addresses thecomplexity of such reconstruction due to indeterminate or erroneoushomology assignments made by sequence alignment algorithms and to thevast prevalence of divergence through speciation over divergence throughgene duplication in protein evolution.

Committee Chair

Kristen M. Naegle

Committee Members

Barak A. Cohen, Justin C. Fay, James J. Havranek, Gary D. Stormo,

Comments

Permanent URL: https://doi.org/10.7936/K7NC60M8

Degree

Doctor of Philosophy (PhD)

Author's Department

Biology & Biomedical Sciences (Computational & Systems Biology)

Author's School

Graduate School of Arts and Sciences

Document Type

Dissertation

Date of Award

Summer 8-15-2017

Language

English (en)

DOI

https://doi.org/10.7936/K7NC60M8

Recommended Citation

Sloutsky, Roman, "Robust Algorithms for Detecting Hidden Structure in Biological Data" (2017). Arts & Sciences Graduate Student Theses and Dissertations. 1215.

The definitive version is available at https://doi.org/10.7936/K7NC60M8

Download

Included in

Biology Commons

COinS

DOI

https://doi.org/10.7936/K7NC60M8

Arts & Sciences Graduate Student Theses and Dissertations

Robust Algorithms for Detecting Hidden Structure in Biological Data

Abstract

Committee Chair

Committee Members

Comments

Degree

Author's Department

Author's School

Document Type

Date of Award

Language

DOI

Recommended Citation

Included in

DOI

Search

Links

Browse

Author Corner

Arts & Sciences Graduate Student Theses and Dissertations

Robust Algorithms for Detecting Hidden Structure in Biological Data

Author

Abstract

Committee Chair

Committee Members

Comments

Degree

Author's Department

Author's School

Document Type

Date of Award

Language

DOI

Recommended Citation

Included in

Share

DOI

Search

Links

Browse

Author Corner