Know When to Trust: Making AI Scoring More Reliable for Educational Assessment

Peter Organisciak
Selcuk Acar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid rise of large language models (LLMs) has created new opportunities for educational measurement. This paper introduces and evaluates three improvements to LLM-based automated scoring tools: model self-confidence, weighted probabilistic scoring, and ensemble modeling. The study utilizes data from over 20,000 responses to the Alternative Uses Task, most popular divergent thinking task, from more than 2,000 participants across multiple studies. Model self-confidence taps into LLMs' internal mechanisms to gauge their confidence in probability estimates, helping identify when machine-generated outputs are trustworthy. Weighted probabilistic scoring considers a broader range of completion possibilities in deriving a final score. The final technique, ensemble models, assesses the performance gains from combining multiple models. These methods, tested in divergent thinking response scoring, each show statistically significant positive results, with improvements in correlation with human judges (from r=0.781 to r=0.823) and reduction in error. All three techniques improve the performance and trustworthiness of automated scoring models, and are compatible as drop-in improvements to existing techniques. The findings suggest that these adjustments can boost the dependability and applicability of LLMs in educational scoring, specifically for systems that derive a quantitative measure from a text input.

Version published to 10.31234/osf.io/26qf3_v1 on OSF Preprints
Feb 28, 2026

Uses and Misuses of Large Language Models in Qualitative Research

This article has 1 author:
1. Jonathan Ben-Menachem
This article has no evaluationsLatest version Mar 17, 2026
Understanding the evolution of individual response process: An exploratory approach

This article has 2 authors:
1. Ruiting Shen
2. Klint Kanopka
This article has no evaluationsLatest version Apr 12, 2026
The Fault in Our STARS: International Evidence that the Statistics Anxiety Rating Scale and the Revised Mathematics Anxiety Rating Scale Overlap

This article has 153 authors:
1. Jenny Terry
2. Robert M Ross
3. Alyssa Counsell
4. Udi Alter
5. Jules L Ellis
6. Nataly Beribisky
7. Andras N. Zsido
8. Patricia Garrido-Vásquez
9. Flávia H. Santos
10. Darcy Hallett
11. Mauricio Salgado
12. Tamas Nagy
13. Ibrahim Ozturk
14. Oliver Lindemann
15. Susan Cooper
16. Argiro Vatakis
17. Yiyun Shou
18. Patrick Aaron O'Connor
19. Dirk Van Rooy
20. Christina Artemenko
21. Mariia M. Ostroha
22. Darren Rhodes
23. Matthew O Parker
24. Nazlı Akay
25. Katherine Swainston
26. Mahmoud Medhat Elsherif
27. Dimitri Löchner
28. Fitri Ariyanti Abidin
29. Stefan Agrigoroaei
30. Rob Cribbie
31. Ilija Milovanović
32. Mai Helmy
33. Anke Buttner
34. Juan David Leongómez
35. Ratna Jatnika
36. Lee Copping
37. Tiago J. S. Lima
38. Nicola Palena
39. Maria F Reyes-Rodriguez
40. Marina Drushlyak
41. Jacob Owusu Sarfo
42. Cristina Rodríguez
43. Hilal H. Sen
44. Anne van Hoogmoed
45. Matus Adamkovic
46. Philipp Schmid
47. Joel Anderson
48. Gregor Stiglic
49. Leona Cilar Budler
50. Lucy Victoria Justice
51. Elise Grimm
52. Roberto Ferreira
53. Letizia Caso
54. donncha hanna
55. Daniel Ansari
56. Ivan Ropovik
57. Juneman Abraham
58. Desirée González
59. Alejandro Estudillo
60. Kristin Jankowsky
61. Kristel de Groot
62. Karel Karsten Himawan
63. Johnrev Guilaran
64. Omid Ghasemi
65. Stephanie Nicole Lyn Schmidt
66. Maria Flakus
67. Neslihan Özhan
68. Tessa R. Flack
69. Jodie Chapman
70. Gabriel Baník
71. Linda Marjoleine Geven
72. Sanne van der Ven
73. Jochen A. Mosbacher
74. Michael Batashvili
75. Tolga Ergiyen
76. Katharina Schmid
77. Ratri Nurwanti
78. Antonino Callea
79. Burcu Tekeş
80. Maxine M.C. Storm
81. Saadet Yapan
82. Angélica Polvani Trassi
83. Rizqy Amelia Zein
84. feyza topçu
85. Kristel Mikkor
86. Gaetan Mertens
87. Merve G. Tutlu
88. Zsofia K. Takacs
89. James M Clay
90. Conal Monaghan
91. Ali H. Al-Hoorie
92. Andrea Greco
93. Mauricio Garrido
94. Charlie Lea
95. Karin Täht
96. Balazs Aczel
97. Ahmed Al Khateeb
98. Gabriella Daroczy
99. Lisa Webster
100. Kareena McAloney-Kocaman
101. Samantha Stanley
102. Jordan Randell
103. Joanna E. Lewis
104. Kinga Morsanyi
105. Iro Xenidou-Dervou
106. Bhasker Malu
107. Hans-Christoph Nuerk
108. Martin Barwood
109. Elizabeth Walters
110. Suzanna Forwood
111. Zoe Flack
112. Deborah Crossland
113. Erin Michelle Buchanan
114. Michael J. Platow
115. Andrew Roberts
116. TamilSelvan Ramis
117. Bruno Verschuere
118. Siobhán M Griffin
119. Julia Bahnmueller
120. Shane Lindsay
121. Thomas Hunt
122. Violeta Enea
123. Samuel P. Hills
124. Zoe Leviston
125. Crystal Nicole Steltenpohl
126. Theresa Elise Wege
127. Mariah Lecompte
128. Grace McMahon
129. Kalu Timothy Uyor Ogba
130. Amy Warhurst
131. Alexander Karner
132. Oliver James Clark
133. Katie Anne Gilligan-Lee
134. Nicholas A Badcock
135. Sophie Leonard
136. Christopher James Hand
137. Ho Yan Lai
138. Samuel G. Penny
139. Olena V. Semenikhina
140. Felix O. Egara
141. Yee Cheung
142. Jonas De keersmaecker
143. Frank Scharnowski
144. Mojtaba Soltanlou
145. Matthew Brolly
146. Jordan Rose Wagge
147. Unita Werdi Rahajeng
148. Peter Branney
149. Sophia Christin Weissgerber
150. Fergus Michael Guppy
151. Skylar Taylor
152. Andy Peter Field
153. Elizabeth Collins
This article has no evaluationsLatest version Mar 20, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Uses and Misuses of Large Language Models in Qualitative Research

Understanding the evolution of individual response process: An exploratory approach

The Fault in Our STARS: International Evidence that the Statistics Anxiety Rating Scale and the Revised Mathematics Anxiety Rating Scale Overlap