ORCID

https://orcid.org/0009-0007-0625-0499

Date of Award

Spring 5-2023

Author's School

McKelvey School of Engineering

Author's Department

Computer Science & Engineering

Degree Name

Master of Science (MS)

Degree Type

Thesis

Abstract

This thesis addresses the need for a fair evaluation of language models' problem solving abilities by presenting a unified evaluation framework for ChatGPT on 16 problem solving datasets (e.g., NaturalQA, HellaSwag, MMLU, etc.). We evaluate the model's performance using F1, exact match, and quasi-exact match metrics and find that ChatGPT is highly accurate in solving tasks that require commonsense and knowledge. However, we also identify truncated text bias and few-shot scenarios as challenges that may impact ChatGPT's performance. Our research highlights the importance of standardizing datasets and developing a unified evaluation system for the fair evaluation of language models. Overall, our contributions include the development of a unified evaluation framework, the identification of performance challenges, and insights into the importance of dataset standardization for the fair evaluation of language models.

Language

English (en)

Chair

Chenguang Wang, Computer Science & Engineering

Committee Members

Chien-Ju Ho, William Yeoh

Share

COinS