Abstract
This thesis addresses the need for a fair evaluation of language models' problem solving abilities by presenting a unified evaluation framework for ChatGPT on 16 problem solving datasets (e.g., NaturalQA, HellaSwag, MMLU, etc.). We evaluate the model's performance using F1, exact match, and quasi-exact match metrics and find that ChatGPT is highly accurate in solving tasks that require commonsense and knowledge. However, we also identify truncated text bias and few-shot scenarios as challenges that may impact ChatGPT's performance. Our research highlights the importance of standardizing datasets and developing a unified evaluation system for the fair evaluation of language models. Overall, our contributions include the development of a unified evaluation framework, the identification of performance challenges, and insights into the importance of dataset standardization for the fair evaluation of language models.
Committee Chair
Chenguang Wang, Computer Science & Engineering
Committee Members
Chien-Ju Ho, William Yeoh
Degree
Master of Science (MS)
Author's Department
Computer Science & Engineering
Document Type
Thesis
Date of Award
Spring 5-2023
Language
English (en)
DOI
https://doi.org/10.7936/7vz0-dr08
Author's ORCID
https://orcid.org/0009-0007-0625-0499
Recommended Citation
Zeng, Fankun, "Evaluating the Problem Solving Abilities of ChatGPT" (2023). McKelvey School of Engineering Theses & Dissertations. 849.
The definitive version is available at https://doi.org/10.7936/7vz0-dr08