Author's School

Graduate School of Arts & Sciences

Author's Department/Program

Biology and Biomedical Sciences: Computational and Systems Biology


English (en)

Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)

Chair and Committee

Gary Stormo


Computational techniques for microbial genomic sequence analysis are becoming increasingly important. With next–generation sequencing technology and the human microbiome project underway, current sequencing capacity is significantly greater than the speed at which organisms of interest can be experimentally probed. We have developed a method that will primarily use available sequence data in order to determine prokaryotic transcription factor binding specificities. The prototypical prokaryotic transcription factor: TF) contains a helix–turn–helix: HTH) fold and bind DNA as homodimers, leading to their palindromic motif specificities. The connection between the TF and its promoter is based on the autoregulation phenomenon noticed in E. coli. Approximately 55% of the TFs analyzed were estimated to be autoregulated. Our preliminary analysis using RegulonDB indicates that this value increases to 79% if one considers the neighboring operons. Given the TF family of interest, it is necessary to find the relevant TF proteins and their associated genomes. Due to the scale–free network topology of prokaryotic systems, many of the transcriptional regulators regulate only one or a few operons. Within a single genome, there would not be enough sequence–based signal to determine the binding site using standard computational methods. Therefore, multiple bacterial genomes are used to overcome this lack of signal within a single genome. We use a distance–based criteria to define the operon boundaries and their respective promoters. Several TF–DNA crystal structures are then used to determine the residues that interact with the DNA. These key residues are the basis for the TF comparison metric; the assumption being that similar residues should impart similar DNA binding specificities. After defining the sets of TF clusters using this metric, their respective promoters are used as input to a motif finding procedure. This method has currently been tested on the LacI and TetR TF families with successful results. On external validation sets, the specificity of prediction is ∼80%. These results are important in developing methods to define the DNA binding preferences of the TF protein residues, known as the “recognition code”. This “recognition code” would allow computational design and prediction of novel DNA–binding specificities, enabling protein-engineering and synthetic biology applications.



Permanent URL: