Date of Award
4-23-2025
Degree Name
Doctor of Philosophy (PhD)
Degree Type
Dissertation
Abstract
Text has become an essential source of data in political science research. Scholars increasingly rely on textual materials—such as legislative speeches, party manifestos, and social media posts—to study the two-way flow of information between elites and the public. Despite this growing reliance, commonly used text-as-data methods often struggle to capture the semantic richness and contextual nuances of political language. Many of these approaches depend heavily on surface-level features like word frequencies and co-occurrence patterns, which constrain their ability to uncover deeper meanings or rhetorical strategies employed by political actors. Meanwhile, advances in computer science have significantly transformed the landscape of natural language processing. A key development was the introduction of pre-trained language models such as BERT, which capture contextual meaning by training on large corpora and fine-tuning for downstream tasks. These models marked a substantial improvement over earlier bag-of-words text representation approaches by enabling more accurate and contextual understanding of texts. Building on this foundation, researchers have continued to scale up model architectures and training data, leading to the emergence of large language models (LLMs) and generative artificial intelligence (AI). These models exhibit extraordinary generalization capabilities, allowing users to perform complex language tasks through plain-language instructions and zero- or few-shot prompt engineering, without requiring extensive labeled data or advanced programming skills. In response, computational social scientists have begun to integrate these new tools into political research. As the scope of available corpora and analytical tasks has expanded, so too have the methodological innovations. Recent work leverages deep neural architectures to extract meaning, builds domain-specific classifiers to label latent features, and applies semantic models to analyze political narratives. This dissertation contributes to this growing body of work by demonstrating how cutting-edge LLMs and AI methods can overcome the limitations of traditional approaches in measuring text similarity and automated frame analysis. Both are important for addressing various questions in areas such as political communication, legislative politics, and democratic representation. I demonstrate the effectiveness of the proposed new methods by applying them to measure media and policy framing, rhetorical polarization, and party message discipline using diverse corpora, spanning from short, informal texts to long, formal documents. This dissertation not only highlights the methodological challenges posed by different types of corpora and analytical tasks, but also develops use cases that illustrate how LLM-based approaches can offer solutions and enable more accurate and nuanced measurement of core political concepts. In Chapter 1, I argue that the most commonly used methods in political science struggle to identify when two texts convey the same meaning as they rely too heavily on identifying words that appear in both documents. This issue is especially salient when the underlying documents are short, an increasingly prevalent form of textual data in modern political research. To address the limitation of current methods, I introduce a transformer model, cross-encoders, which utilizes pair embedding technique that considers the context of both snippets to achieve better estimates of semantic similarity for short texts, such as news headlines and Facebook posts. I illustrate this model in three examples in American politics. First, I apply an off-the-shelf pretrained cross-encoder to measure the similarity between social messages written by experimental subjects and the original Reuters article about the US economic performance that they read in a ``telephone-game'' conducted by Carlson (2019), showing that the cross-encoder estimates of information distortion are better at capturing the amount of partisan bias contained in social messages. In the second application which studies the competing media framing of US Supreme Court (SCOTUS), I train a customized cross-encoder model with manually labeled pairs of news headlines to predict the heterogeneity of media coverage of case decisions. The cross-encoder not only outperforms a wide range of word-based and sentence embedding approaches, but also uncover empirical patterns that otherwise would be missed—cases with published dissents receive more diverse coverage than unanimous decisions. The last example presents a more challenging task in which I apply cross-encoder and other models to measure the similarities of social media posts from inter-party and intra-party US senators that a topic model has already identified to be on the same policy issue. Only the cross-encoder yields conclusions that are predicted by established theories, which state that elite polarization is more intensive in domestic policy compared to international affairs. Chapter 2 examines the relationship between institutional power and message discipline using a text similarity approach, which contributes to our understanding of message politics in US Congress. In particular, this chapter implements keyword-assisted topic models for policy agenda coding and OpenAI’s embedding model, which can accommodate extensive context windows of long documents, to generate high-quality text representations for each speech in the congressional record between 1973–2016. This approach allows to estimate the semantic similarity of speech pairs within the same topic, day, and party, serving as a measure of rhetorical unity between leaders and non-leaders. Parties build electorally beneficial brands by staying ``on message.'' But when can congressional parties exercise message discipline, who contributes, and how do constituents respond? Building on theories of congressional party discipline, we test a set of competing hypotheses: that institutional power could help or hinder messaging, that the Republican coalition is ideological and homogeneous, and that safe seat members have less constituency concerns. The results show that, generally, institutional power weakens message discipline. However, House Republicans leverage procedural power to offset this disadvantage. Additionally, behavioral evidence suggests that message discipline increases the approval of representatives from copartisan voters. Such results contribute to the literature on message politics and have implications for legislator orientation and thermostatic backlash. Chapter 3 proposes a novel framework of computational frame analysis, which is is central to studies in political communication. Existing methods struggle to identify emerging frames from evolving discourses efficiently. Supervised approaches are labor-intensive and time-consuming. Unsupervised approaches such as topic models and dictionaries rely on clusters of keywords that lack semantic context to capture nuanced framings. Leveraging the extraordinary capability of LLMs in information extraction and summarization, this chapter presents a new chain-of-thoughts prompting method that follows three steps---quote, summarize, and synthesize---to gradually condense concrete text information (e.g., proportions of texts quoted from original documents) into abstract concepts (e.g., frame labels). To discover substantively meaningful frames, I develop a human-AI interaction algorithm to merge similar labels and identify categories. Specifically, I apply GPT-4o to extract emphasis frames in news articles collected by Gilardi, Shipan and Wüest (2021) concerning the adoption of smoking restrictions in public areas across US states. Compared to topic modeling, crowdsourced validation shows that the proposed method produces semantically distinctive and coherent text features for identifying subtly different frames that better align with human interpretation. Moreover, these features reveal frame correlation patterns that align with established theories, which suggests that, as policies diffuse, ideological debates about policy adoption gradually shift toward more technical discussions of policy implementation. Future research can continue to explore the application of LLM-based frame analysis to other policy domains or media narratives. As generative AI continues to advance, integrating such methods into social science research has the potential to significantly improve our capacity to capture the framing of key issues in rapidly changing political discourse.
Language
English (en)
Chair and Committee
Jacob Montgomery
Committee Members
Betsy Sinclair; Christopher Lucas; Ju Yeon Park; Ted Enamorado
Recommended Citation
Lin, Gechun, "Three Essays on Methodological Advancements of Large Language Models in Political Science Research" (2025). Arts & Sciences Theses and Dissertations. 3564.
https://openscholarship.wustl.edu/art_sci_etds/3564