Date of Award
12-17-2024
Degree Name
Doctor of Philosophy (PhD)
Degree Type
Dissertation
Abstract
Accurate extraction of clinical entities and phenotypes from unstructured electronic health record (EHR) text is crucial for various clinical research tasks, including cohort identification, tracking temporal patterns in disease progression and deciding treatment course. However, this task remains challenging due to the complexity and ambiguity of medical language. This dissertation explores the application of advanced generative pre-trained transformer (GPT) models, such as GPT-4, GPT-3.5-turbo, Llama-3.1, Llama-3 and Flan-T5, for clinical entity and phenotype extraction from EHRs. Building upon these findings, this dissertation also investigates a hybrid approach where integration of external knowledge sources, such as Unified Medical Language System (UMLS) and large language models (LLMs), is evaluated to improve the quality of the language model outputs. Through extensive experiments and evaluation, we find that extraction improvements are possible with LLMs and knowledge base integration. We also find that LLM hyperparameters, such as temperature and prompt variations significantly impact consistency and accuracy, with lower temperature setting yielding more stable outputs but not necessarily higher accuracy. Additionally, variations in clinical text and model configurations reveal a trade-off between consistency and performance, suggesting that careful tuning is essential for balancing reliable results with clinical accuracy. These findings have implications for facilitating faster and more accurate cohort identification, supporting clinical decision-making, identifying temporal patterns in disease progression, and ultimately enabling more effective utilization of EHR data for clinical informatics research.
Language
English (en)
Chair
Albert Lai
Committee Members
Chenyang Lu; Fuhai Li; Philip Payne; Roger Chamberlain