Predicting Item Difficulties in C-Tests Using Linguistic Features and Transformer-Based Language Models

Priscilla Achaa-Amankwaa
Björn Erik Hommel
Alexander Robitzsch
Stefan Schipolowski
Ulrich Schroeders

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

C-Tests, a specific form of cloze tests requiring the completion of truncated word endings in short text passages, are widely used in educational measurement as indicators of reading comprehension, general language proficiency, and crystallized intelligence. In this paper, we examine the feasibility of a more rational construction of C-Tests by estimating item difficulty solely from textual features, without relying on prior empirical testing. For that, we use item-level (e.g., word length, word category, and word frequency), sentence-level (e.g., existence of sentence negation), and text-level (e.g., readability) surface features to predict item difficulty and compare these predictions to empirical difficulty estimates. Furthermore, we evaluate the added predictive performance derived from two transformer-based language models: BERT and GPT-4o. We reanalyze data from a large-scale educational study in which 1,197 German secondary school students worked on 16 C-Tests. Using elastic net regression with nested resampling, we found that surface features explained approximately 20% of the variance in out-of-sample predictions. While BERT estimates did not yield any incremental predictive performance over linguistic features, GPT-based predictions added an increment of 6%, resulting in a total explained variance of R² = 26% in out-of-sample predictions. We discuss the strengths and limitations of using large language models in language assessment in specific and test construction in general, as well as the challenges that must be addressed to fully realize their potential.

Version published to 10.31234/osf.io/s9n7q_v1 on OSF Preprints
Mar 20, 2026

Standardized Assessment of LLM English Proficiency

This article has 7 authors:
1. Shangchao Min
2. Shaonan Wang
3. Xinyu Gao
4. Hui Wang
5. Zhiling Jin
6. Chen Ling
7. Nai Ding
This article has no evaluationsLatest version Feb 19, 2026
A Data-Driven Cognitive Feature–Based Model for English Text Readability Assessment to Support College English Instruction

This article has 2 authors:
1. Jing Zhao
2. Congrong Zou
This article has no evaluationsLatest version Mar 2, 2026
AI-generated estimates of Dutch words and expressions

This article has 5 authors:
1. Marc Brysbaert
2. Javier Conde
3. Juan Haro
4. Carlos Arriaga
5. Pedro Reviriego
This article has no evaluationsLatest version Mar 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Standardized Assessment of LLM English Proficiency

A Data-Driven Cognitive Feature–Based Model for English Text Readability Assessment to Support College English Instruction

AI-generated estimates of Dutch words and expressions