How well can automated speech processing score early elementary student verbal responses on language and literacy assessments?

Ashley Edwards
Nuria Gutiérrez
Yaacov Petscher

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Many literacy screeners have begun to implement automated pronunciation scoring to automatically score student verbal responses. However, little research has been done to evaluate item-level accuracy or explore factors that lead to inaccurately scored responses. The purpose of this study was to compare the accuracy of various pronunciation scoring and transcription methods in scoring student response accuracy against live human scoring of performance across word reading, blending, deletion, and expressive vocabulary tasks, tasks commonly used in literacy screening. Audio responses were recorded via iPad while a child in kindergarten-1st grade completed a screening assessment battery facilitated and scored by a live human tester. A subsample of 100 kids were selected for each task and scores from two human audio listeners, pronunciation scoring methods of SoapBox Labs, Azure, Language Confidence, Speechace, and SpeechSuper, and transcription based methods using OpenAI’s Whisper model were compared against the scores provided by the live human tester. Results showed accuracy of automatic scoring methods was far below that of human scorers, suggesting that automated methods are not yet ready to mimic human scoring. The present findings highlight both the promise and current limitations of automated speech processing technology for scoring elementary students’ oral language and literacy responses.

Version published to 10.31234/osf.io/r3ydm_v1 on OSF Preprints
Apr 7, 2026

The role of linguistic knowledge in verbal fluency tests: How individual differences in language skills shape the mental lexicon

This article has 3 authors:
1. Kyla McConnell
2. Berit Reise
3. antje meyer
This article has no evaluationsLatest version Mar 11, 2026
Can Large Language Models Emulate Human Performance on Educational Assessments?

This article has 4 authors:
1. Xiuxiu Tang
2. Yikai Lu
3. John T. Behrens
4. Ying Cheng
This article has no evaluationsLatest version Apr 23, 2026
Speech Perception Consistency Facilitates Initial Lexical Activation, but Not Speech Perception Flexibility

This article has 3 authors:
1. Brian W. L. Wong
2. Arthur G. Samuel
3. Efthymia C Kapnoula
This article has no evaluationsLatest version Apr 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

The role of linguistic knowledge in verbal fluency tests: How individual differences in language skills shape the mental lexicon

Can Large Language Models Emulate Human Performance on Educational Assessments?

Speech Perception Consistency Facilitates Initial Lexical Activation, but Not Speech Perception Flexibility