Machines flunking an exam: Evaluating large language models on course-related open questions
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) have grown rapidly in China since the rise of ChatGPT, touching many fields. These models offer a promising solution for responding to students’ questions while learning. However, most relevant research has addressed English-language and multiple-choice questions, namely on general or medical topics; LLMs’ performance in languages such as Chinese and in answering course-related open questions is less clear. To evaluate the performance of LLMs for answering course-related open questions in Chinese, this study explores how well LLMs respond to open-ended, course-specific queries in Chinese (glossary, short-answer, and essay questions). Answers from six LLMs were evaluated based on expert- and machine-assigned grades. Correlation analysis revealed which metrics were most efficacious in automated assessment. We next compared each LLM’s scores to determine the three best-performing models, which were then used for comparison with students. Overall, the selected LLMs’ performance was unsatisfactory: these models demonstrated lower scores relative to students, especially on glossary questions. We have thus recommended several ways to refine LLMs’ operation. These suggestions can help to promote the spread of LLMs and serve as a reference for students and educators when asking course-related open questions. The implications, limitations, and research directions arising from our study are also discussed.