Abstract

This thesis addresses the need for a fair evaluation of language models' problem solving abilities by presenting a unified evaluation framework for ChatGPT on 16 problem solving datasets (e.g., NaturalQA, HellaSwag, MMLU, etc.). We evaluate the model's performance using F1, exact match, and quasi-exact match metrics and find that ChatGPT is highly accurate in solving tasks that require commonsense and knowledge. However, we also identify truncated text bias and few-shot scenarios as challenges that may impact ChatGPT's performance. Our research highlights the importance of standardizing datasets and developing a unified evaluation system for the fair evaluation of language models. Overall, our contributions include the development of a unified evaluation framework, the identification of performance challenges, and insights into the importance of dataset standardization for the fair evaluation of language models.

Committee Chair

Chenguang Wang, Computer Science & Engineering

Committee Members

Chien-Ju Ho, William Yeoh

Degree

Master of Science (MS)

Author's Department

Computer Science & Engineering

Author's School

McKelvey School of Engineering

Document Type

Thesis

Date of Award

Spring 5-2023

Language

English (en)

Author's ORCID

https://orcid.org/0009-0007-0625-0499

Share

COinS