ORCID
https://orcid.org/0009-0007-0625-0499
Date of Award
Spring 5-2023
Degree Name
Master of Science (MS)
Degree Type
Thesis
Abstract
This thesis addresses the need for a fair evaluation of language models' problem solving abilities by presenting a unified evaluation framework for ChatGPT on 16 problem solving datasets (e.g., NaturalQA, HellaSwag, MMLU, etc.). We evaluate the model's performance using F1, exact match, and quasi-exact match metrics and find that ChatGPT is highly accurate in solving tasks that require commonsense and knowledge. However, we also identify truncated text bias and few-shot scenarios as challenges that may impact ChatGPT's performance. Our research highlights the importance of standardizing datasets and developing a unified evaluation system for the fair evaluation of language models. Overall, our contributions include the development of a unified evaluation framework, the identification of performance challenges, and insights into the importance of dataset standardization for the fair evaluation of language models.
Language
English (en)
Chair
Chenguang Wang, Computer Science & Engineering
Committee Members
Chien-Ju Ho, William Yeoh