Stacked Ensemble Learning for Content-Based Item Difficulty Prediction

Yuxiao Zhang
Yanyan Fu
Kyung T. Han

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Content-based prediction of item parameters has potential value for supporting item development and calibration-related tasks, particularly when operational calibration is costly or slow. The present study evaluated a three-stage stacked ensemble framework for predicting item difficulty from item content using 1,079 verbal reasoning items from a higher-education admission test. In Stage 1, a large language model coded 10 theoretically motivated content features, which were then used in a random forest as predictors. In Stage 2, a transformer encoder (DeBERTa-v3-base) was fine-tuned to predict difficulty directly from raw item text. In Stage 3, a ridge regression meta-learner combined Stage 1 and Stage 2 predictions. Performance was evaluated across five random train-test splits using Pearson correlation and root mean squared error (RMSE). The feature-based model outperformed the standalone text-based model on held-out data (r = .314 vs. .273), suggesting that structured, cognitively oriented features were more informative than encoder-only text representations in this dataset. The stacked model yielded the highest test-set correlation (r = .354, RMSE = 0.743), indicating modest improvement over either base learner alone and supporting the view that the two approaches captured partially complementary information. Feature-importance analyses indicated that reasoning steps, task type, and option complexity were the strongest unique predictors. Although the observed level of accuracy was insufficient for standalone operational use, the findings suggest that item content contains recoverable information about difficulty and that integrating interpretable feature-based and text-based representations is a promising direction for supporting calibration workflows.

Version published to 10.35542/osf.io/7ukxd_v1 on OSF Preprints
Apr 17, 2026

Deep Learning Cognitive Diagnosis Models for Modeling Response and Process Data under Exam Settings

This article has 2 authors:
1. Yikai Lu
2. Wenchao Ma
This article has no evaluationsLatest version Apr 18, 2026
Generative Psychometrics via AI-GENIE: Automatic Item Generation and Validation with Network-Integrated Evaluation

This article has 3 authors:
1. Lara Lee Russell-Lasalandra
2. Alexander P. Christensen
3. Hudson Golino
This article has no evaluationsLatest version Apr 20, 2026
Weighted Likelihood Estimation of Latent Ability in Sequential Item Response Theory: Properties and Comparisons

This article has 2 authors:
1. Yikai Lu
2. Ying Cheng
This article has no evaluationsLatest version Apr 18, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Deep Learning Cognitive Diagnosis Models for Modeling Response and Process Data under Exam Settings

Generative Psychometrics via AI-GENIE: Automatic Item Generation and Validation with Network-Integrated Evaluation

Weighted Likelihood Estimation of Latent Ability in Sequential Item Response Theory: Properties and Comparisons