Abstract

Intrinsically disordered proteins and regions (IDRs) lack stable three-dimensional structure under physiological conditions. Instead, IDRs are better described by a conformational equilibrium wherein these proteins rapidly interconvert between many distinct structural states. Although they lack a well-defined reference fold, IDRs are ubiquitous across the Tree of Life and play essential roles in virtually every biological process, including gene regulation, molecular recognition, and signal transduction. The absence of a well-defined fold, however, makes IDRs difficult to interpret and challenging to engineer. Traditional structure-based approaches that rely on tertiary structure or evolutionary conservation are poorly suited to handle the complexity of IDRs. My work advances two complementary paradigms for understanding and designing IDRs. In the first paradigm, sequence → ensemble, I interpret IDR sequences through a molecular biophysical lens: observables derived from the statistics of disordered conformational ensembles are used to generate mechanistic hypotheses about IDR function and used to guide hypothesis generation and design. To enable this at scale, I develop high-throughput sequence-to-ensemble predictors that enable us to navigate disordered conformational landscapes directly from sequence. These models are implemented in robust, user-friendly software, making quantitative ensemble-based analysis accessible across large protein sets, not just individual case studies. In the second paradigm, sequence → function, I develop disorder-specific deep learning models to infer functional sequence constraints directly from the amino acid sequence. Instead of relying on biophysical models, this approach leverages generative modeling and learned representations to design disordered regions. Building on recent advances in natural language processing, I introduce a diffusion-based protein language model tailored to intrinsically disordered regions that learns IDR-specific sequence representations and can condition on adjacent folded domains when present. This allows the model to capture how local sequence context constrains disordered regions, enabling the context-aware design of disordered protein sequences. A defining feature throughout this body of work is its high-throughput, software-first implementation. I design and implement robust, scalable tools that make these models easy to deploy, integrate, and extend within diverse protein bioinformatics and protein design workflows for both computational and experimental researchers. Collectively, these methods and tools are intended to enable a broad community of researchers to systematically probe, predict, and engineer intrinsically disordered proteins and protein regions.

Committee Chair

Alex Holehouse

Committee Members

Andrea Soranno; Eric Galbert; Joshua Rackers; Michael Brent; Roman Garnett

Degree

Doctor of Philosophy (PhD)

Author's Department

Biology & Biomedical Sciences (Computational & Systems Biology)

Author's School

Graduate School of Arts and Sciences

Document Type

Dissertation

Date of Award

4-28-2026

Language

English (en)

Author's ORCID

https://orcid.org/0000-0002-5022-7006

Available for download on Thursday, April 27, 2028

Included in

Biophysics Commons

Share

COinS