Machines flunking an exam: Evaluating large language models on course-related open questions

Jingxiu Huang
Yufeng Wei
Lixin Zhang
Ruilin Lai
Feiyu Lai
Yunxiang Zheng

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) have grown rapidly in China since the rise of ChatGPT, touching many fields. These models offer a promising solution for responding to students’ questions while learning. However, most relevant research has addressed English-language and multiple-choice questions, namely on general or medical topics; LLMs’ performance in languages such as Chinese and in answering course-related open questions is less clear. To evaluate the performance of LLMs for answering course-related open questions in Chinese, this study explores how well LLMs respond to open-ended, course-specific queries in Chinese (glossary, short-answer, and essay questions). Answers from six LLMs were evaluated based on expert- and machine-assigned grades. Correlation analysis revealed which metrics were most efficacious in automated assessment. We next compared each LLM’s scores to determine the three best-performing models, which were then used for comparison with students. Overall, the selected LLMs’ performance was unsatisfactory: these models demonstrated lower scores relative to students, especially on glossary questions. We have thus recommended several ways to refine LLMs’ operation. These suggestions can help to promote the spread of LLMs and serve as a reference for students and educators when asking course-related open questions. The implications, limitations, and research directions arising from our study are also discussed.

Version published to 10.21203/rs.3.rs-6679662/v1 on Research Square
Sep 30, 2025

Automating Evaluation of LLM-generated Responses to Patient Questions about Rare Diseases

This article has 7 authors:
1. Min Zhao
2. Inez Y. Oh
3. Aditi Gupta
4. Sally Cohen-Cutler
5. Kathryn M. Harmoney
6. Albert M. Lai
7. Bryan A. Sisk
This article has no evaluationsLatest version Oct 7, 2025
ChatGPT’s Ability to Answer Cancer-Related Basic Questions in Urdu: A Comparative Study with English Responses

This article has 4 authors:
1. Waqas Ahmed Khan
2. Misbah Soomro
3. Muhammad Afzal
4. Adeeba Zaki
This article has no evaluationsLatest version Oct 7, 2025
LLMs4All: A Review of Large Language Models Across Academic Disciplines

This article has 31 authors:
1. Yanfang Ye
2. Zheyuan Zhang
3. Tianyi Ma
4. Zehong Wang
5. Yiyang Li
6. Shifu Hou
7. Weixiang Sun
8. Kaiwen Shi
9. Yijun Ma
10. Wei Song
11. Ahmed Abbasi
12. Ying Cheng
13. Jane Cleland-Huang
14. Steven Corcelli
15. Patricia Culligan
16. Robert Goulding
17. Ming Hu
18. Ting Hua
19. John Lalor
20. Fang Liu
21. Tengfei Luo
22. Ed Maginn
23. Nuno Moniz
24. Jason Rohr
25. Brett Savoie
26. Daniel Slate
27. Tom Stapleford
28. Matthew Webber
29. Olaf Wiest
30. Johnny Zhang
31. Nitesh Chawla
This article has no evaluationsLatest version Oct 17, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Automating Evaluation of LLM-generated Responses to Patient Questions about Rare Diseases

ChatGPT’s Ability to Answer Cancer-Related Basic Questions in Urdu: A Comparative Study with English Responses

LLMs4All: A Review of Large Language Models Across Academic Disciplines