Automated Short Answer Grading from an Educational Assessment Perspective: A Systematic Review
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Manual grading of short-answer questions is labor-intensive, time-consuming, and inconsistent. As digital technologies and online learning environments have expanded, interest in Automated Short Answer Grading (ASAG) has increased and its development has accelerated. Researchers have proposed a wide range of ASAG methods and systems. This growing body of work raises important questions about how well ASAG aligns with educational assessment principles. This study presents a systematic review of ASAG research from 2014 to 2024, focusing on educational assessment considerations. The review examined how ASAG approaches have evolved over the past decade and the extent to which studies address key assessment principles such as reliability, validity, and test design. Following PRISMA guidelines, we conducted a Google Scholar search to identify publications incorporating ASAG models. A total of 110 studies met the inclusion criteria and were systematically coded using a 24-variable scheme that covers publication metadata, dataset and item characteristics, modelling approaches, scoring materials and grading scales, human rater involvement, evaluation procedures, and evidence related to reproducibility and validity. Our analysis reveals several notable trends and gaps. Our review found that most ASAG studies rely on a small set of English-language datasets, with very few covering other languages, highlighting a need for broader multilingual research. Additionally, no research reported conducting pilot testing or formal test development, indicating limited attention to test quality and validity. Inconsistencies in scoring scales and evaluation metrics were also noted, hindering comparability across studies. Over the decade, techniques shifted from rule-based and feature-based algorithms to deep learning models, including transformer-based networks and large language models, yielding improved performance. However, a gap persists between advanced model development and best practices in educational assessment. We recommend aligning future ASAG research with established measurement standards to enhance fairness, reliability, and validity in automated scoring systems in education. To promote transparency, we also share our compiled dataset and coding schema as supplementary materials for further research and practice.