Author's School

Graduate School of Arts & Sciences

Author's Department/Program

Biology and Biomedical Sciences: Computational and Systems Biology


English (en)

Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)

Chair and Committee

Gary Stormo


Transcription factors: TFs) play a central role in the gene regulatory network of each cell. They can stimulate or inhibit transcription of their target genes by binding to short, degenerate DNA sequence motifs. The goal of this research is to build improved models of TF binding site recognition. This can facilitate the determination of regulatory networks and also allow for the prediction of binding site motifs based only on the TF protein sequence. Recent technological advances have rapidly expanded the amount of quantitative TF binding data available. PBMs: Protein Binding Microarrays) have recently been implemented in a format that allows all 10mers to be assayed in parallel. There is now PBM data available for hundreds of transcription factors. Another fairly recent technique for determining the binding preference of a TF is an in vivo bacterial one-hybrid assay: B1H). In this approach a TF is expressed in E. coli where it can be used to select strong binding sites from a library of randomized sites located upstream of a weak promoter, driving expression of a selectable gene. When coupled with high throughput sequencing and a newly developed analysis method, quantitative binding data can be obtained. In the last few years, the binding specificities of hundreds of TFs have been determined using B1H. The two largest eukaryotic transcription factor families are the zf-C2H2 and homeodomain TF families. Newly available PBM and B1H specificity models were used to develop recognition models for these two families, with the goal of being able to predict the binding specific of a TF from its protein sequence. We developed a feature selection method based on adjusted mutual information that automatically recovers nearly all of the known key residues for the homeodomain and zf-C2H2 families. Using those features we find that, for both families, random forest: RF) and support vector machine: SVM) based recognition models outperform the nearest neighbor method, which has previously been considered the best method.


Permanent URL: