How well can automated speech processing score early elementary student verbal responses on language and literacy assessments?

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Many literacy screeners have begun to implement automated pronunciation scoring to automatically score student verbal responses. However, little research has been done to evaluate item-level accuracy or explore factors that lead to inaccurately scored responses. The purpose of this study was to compare the accuracy of various pronunciation scoring and transcription methods in scoring student response accuracy against live human scoring of performance across word reading, blending, deletion, and expressive vocabulary tasks, tasks commonly used in literacy screening. Audio responses were recorded via iPad while a child in kindergarten-1st grade completed a screening assessment battery facilitated and scored by a live human tester. A subsample of 100 kids were selected for each task and scores from two human audio listeners, pronunciation scoring methods of SoapBox Labs, Azure, Language Confidence, Speechace, and SpeechSuper, and transcription based methods using OpenAI’s Whisper model were compared against the scores provided by the live human tester. Results showed accuracy of automatic scoring methods was far below that of human scorers, suggesting that automated methods are not yet ready to mimic human scoring. The present findings highlight both the promise and current limitations of automated speech processing technology for scoring elementary students’ oral language and literacy responses.

Article activity feed