Date of Award

Spring 5-2023

Author's School

McKelvey School of Engineering

Author's Department

Computer Science & Engineering

Degree Name

Master of Science (MS)

Degree Type



This thesis addresses the need for a fair evaluation of language models' problem solving abilities by presenting a unified evaluation framework for ChatGPT on 16 problem solving datasets (e.g., NaturalQA, HellaSwag, MMLU, etc.). We evaluate the model's performance using F1, exact match, and quasi-exact match metrics and find that ChatGPT is highly accurate in solving tasks that require commonsense and knowledge. However, we also identify truncated text bias and few-shot scenarios as challenges that may impact ChatGPT's performance. Our research highlights the importance of standardizing datasets and developing a unified evaluation system for the fair evaluation of language models. Overall, our contributions include the development of a unified evaluation framework, the identification of performance challenges, and insights into the importance of dataset standardization for the fair evaluation of language models.


English (en)


Chenguang Wang, Computer Science & Engineering

Committee Members

Chien-Ju Ho, William Yeoh