Author's School

Graduate School of Arts & Sciences

Author's Department/Program

Biology and Biomedical Sciences: Computational and Systems Biology


English (en)

Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)

Chair and Committee

Gary Stormo


Organisms must control their gene expression to properly respond to developmental, stress or other environmental cues. A key part of this process is transcriptional regulation, which is largely accomplished by a complex network of transcription factor proteins: TFs) interact with their specific binding sites in the genome. Understanding how TFs select correct binding sites out of the vast number of potential binding sites in the genome is a key challenge in molecular biology. Recently, unprecedented amount of quantitative binding data have become available as results of developments in high-throughput experimental techniques. However, interpretation of high-throughput binding data has proved to be controversial, largely due to the lack of physically principled data analysis methods. ii An important question in the analysis of binding data is the complexity of the specificity model needed. This has important implications for both the characterization of specificity and for the prediction of the consequences of mutations. Structurally, TF-DNA interactions are complex with a wide variety of interactions between the protein and DNA making a simple recognition code impossible. Energetically, however, the situation may be much simpler. Detailed studies of a handful of TFs have shown that individual base pairs often contribute independently to the total binding energy. This view of simplicity has been challenged by data from high-throughput binding experiments, although the extent to which the sample model breaks down is uncertain due to lack of rigorous analysis methods. The goal of this thesis is to assess the complexity of model required to accurately represent TF specificity. To this end, I have developed a new statistical analysis method BEEML: Binding Energy Estimation by Maximum Likelihood) that parameterizes models of TF specificity from high-throughput quantitative binding data, using a realistic biophysical model. Employing the BEEML method, I show that the energetics of most TF-DNA interactions are simple, with bases in the binding site contribute approximately independently to the total binding energy. Further, I show that interactions in the binding site occur mostly between adjacent positions.



Permanent URL: