Transforming science marking: A scoping review of auto-markers

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This scoping review explored the performance of recent transformer-based auto-markers. We followed a systematic process, adhering to relevant PRISMA guidelines. Our review included recent literature (from 2017 onwards), focusing on English natural language responses on science content in an educational assessment context. A final set of 21 articles was reviewed and coded in depth to answer our research questions which explored the types of auto-marking models being used, the datasets used to fine-tune and test them, and their performance. The most commonly used models in this context were BERT models and BERT variants, which increased in frequency in recent years reaching a peak in 2021. After 2021, papers using GPT models started to appear. The SciEntsBank dataset was the most commonly used to test auto-markers but several other datasets (e.g., ASAP SAS, Beetle) also featured in our review. BERT models generally performed better than previous models on the SciEntsBank dataset. As of yet, GPT models have not been evaluated on SciEntsBank but there was one study in the review that directly compared GPT-3.5 and BERT base and found that GPT-3.5 outperformed BERT base across different items and item types. The review also shows that models that utilise additional forms of data like textbooks and marking rubrics seem to consistently outperform models without these and that recent auto-markers may still present issues in terms of low reliability, lack of explainability and bias.

Article activity feed