Date of Award

Spring 5-15-2023

Author's School

Graduate School of Arts and Sciences

Author's Department

Biology & Biomedical Sciences (Computational & Systems Biology)

Degree Name

Doctor of Philosophy (PhD)

Degree Type



Gene regulation allows for the quantitative control of gene expression. Gene regulation is a complex process encoded through cis-regulatory sequences, short DNA sequences containing clusters of transcription factor binding sites. Each binding site can occur millions of times in multicellular genomes, and seemingly similar collections of binding sites can have very different activities. A leading model to explain these degeneracies is that cis-regulatory sequences follow a “grammar” defined by the number, identity, strength, arrangement, and/or context of the underlying binding sites. Understanding cis-regulatory grammar requires high-throughput technology, quantitative measurements, and computational modeling. This thesis describes an iterative machine learning approach to study cis-regulatory grammar using mouse photoreceptors as a model system. First, I characterized sequence features associated with enhancer and silencer activity in sequences bound by the transcription factor CRX. I showed that both enhancers and silencers are highly occupied by CRX compared to inactive sequences, and enhancers are uniquely enriched for a diverse but degenerate collection of eight motifs. I demonstrated that this information captures a majority of the available signal in genomic sequences and developed an information content metric that summarizes the effects of motif number and diversity. Second, I developed an active machine learning framework that iteratively samples informative perturbations to address the limitations of training quantitative models on genomic sequences alone. I showed that this approach, when complemented with human decision-making, effectively guides machine learning models towards a biologically relevant representation of cis-regulatory grammar. I also highlighted how perturbations selected with active learning are more informative than other perturbations generated by the same procedure. The final machine learning model can capture global and local context-dependencies of transcription factor binding motifs. Using this model, I found that the same motifs can produce the same activity in multiple arrangements. Thus, active machine learning is an effective way to sample perturbations that improve quantitative models of cis-regulatory grammar. Collectively, these results provide an iterative framework to design and sample perturbations that reveal the complexities of cis-regulatory grammar underlying gene regulation.


English (en)

Chair and Committee

Barak A. Cohen

Committee Members

Jeremy Buhler, Shiming Chen, Samantha A. Morris, Gary D. Stormo, Michael A. White