Predicting Item Difficulties in C-Tests Using Linguistic Features and Transformer-Based Language Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

C-Tests, a specific form of cloze tests requiring the completion of truncated word endings in short text passages, are widely used in educational measurement as indicators of reading comprehension, general language proficiency, and crystallized intelligence. In this paper, we examine the feasibility of a more rational construction of C-Tests by estimating item difficulty solely from textual features, without relying on prior empirical testing. For that, we use item-level (e.g., word length, word category, and word frequency), sentence-level (e.g., existence of sentence negation), and text-level (e.g., readability) surface features to predict item difficulty and compare these predictions to empirical difficulty estimates. Furthermore, we evaluate the added predictive performance derived from two transformer-based language models: BERT and GPT-4o. We reanalyze data from a large-scale educational study in which 1,197 German secondary school students worked on 16 C-Tests. Using elastic net regression with nested resampling, we found that surface features explained approximately 20% of the variance in out-of-sample predictions. While BERT estimates did not yield any incremental predictive performance over linguistic features, GPT-based predictions added an increment of 6%, resulting in a total explained variance of R² = 26% in out-of-sample predictions. We discuss the strengths and limitations of using large language models in language assessment in specific and test construction in general, as well as the challenges that must be addressed to fully realize their potential.

Article activity feed