Abstract
Accurate extraction of clinical entities and phenotypes from unstructured electronic health record (EHR) text is crucial for various clinical research tasks, including cohort identification, tracking temporal patterns in disease progression and deciding treatment course. However, this task remains challenging due to the complexity and ambiguity of medical language. This dissertation explores the application of advanced generative pre-trained transformer (GPT) models, such as GPT-4, GPT-3.5-turbo, Llama-3.1, Llama-3 and Flan-T5, for clinical entity and phenotype extraction from EHRs. Building upon these findings, this dissertation also investigates a hybrid approach where integration of external knowledge sources, such as Unified Medical Language System (UMLS) and large language models (LLMs), is evaluated to improve the quality of the language model outputs. Through extensive experiments and evaluation, we find that extraction improvements are possible with LLMs and knowledge base integration. We also find that LLM hyperparameters, such as temperature and prompt variations significantly impact consistency and accuracy, with lower temperature setting yielding more stable outputs but not necessarily higher accuracy. Additionally, variations in clinical text and model configurations reveal a trade-off between consistency and performance, suggesting that careful tuning is essential for balancing reliable results with clinical accuracy. These findings have implications for facilitating faster and more accurate cohort identification, supporting clinical decision-making, identifying temporal patterns in disease progression, and ultimately enabling more effective utilization of EHR data for clinical informatics research.
Committee Chair
Albert Lai
Committee Members
Chenyang Lu; Fuhai Li; Philip Payne; Roger Chamberlain
Degree
Doctor of Philosophy (PhD)
Author's Department
Computer Science & Engineering
Document Type
Dissertation
Date of Award
12-17-2024
Language
English (en)
DOI
https://doi.org/10.7936/8gm0-5x80
Recommended Citation
BHATTARAI, KRITI, "Improving Clinical Information Extraction from Electronic Health Records: Leveraging Large Language Models and Evaluating Their Outputs" (2024). McKelvey School of Engineering Theses & Dissertations. 1131.
The definitive version is available at https://doi.org/10.7936/8gm0-5x80