Multimodal test item parameter prediction from text, images, and metadata: Fusing together AI vision and language models

Hotaka Maeda
Yikai Lu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We propose a flexible multimodal model for predicting all dichotomous and polytomous item parameters from text, images, and metadata by fusing representations from encoder Transformer vision and language models. This deep learning model accommodates heterogeneous item formats, including items with any numbers of components such as correct and incorrect options, stimuli, and images. Answer key indicators distinguish correct options from distractors, while an attention pooling technique weights the relative importance of these components. Based on the 2-parameter logistic model and the generalized partial credit model, we predict all item parameters jointly using a masking strategy to ensure that only relevant parameters contribute to the loss. Item-level and item component-level metadata are also included. We evaluate the approach using 40,965 English language arts and mathematics items for grades 3-11. A single model accommodated both exam subjects, all eleven item types, and all item parameters, eliminating the need for multiple specialized models. However, results indicate that the full model was unable to leverage all input data. Often, prediction accuracy was unchanged as features were removed. Images were good predictors on their own (item intercept $R^2=.25$), but did not consistently contribute unique information when combined with text and metadata. Of the strongest performing models, the most parsimonious variant achieved $R^2$ values of .67, .45, .39, .34, .66, .33, and .74 for item intercept, discrimination, difficulty, and four polytomous step threshold parameters, respectively. Findings suggest that current training methods may limit the learnability of complex multimodal deep fusion models.

Version published to 10.35542/osf.io/vr93a_v1 on OSF Preprints
Mar 30, 2026

Stacked Ensemble Learning for Content-Based Item Difficulty Prediction

This article has 3 authors:
1. Yuxiao Zhang
2. Yanyan Fu
3. Kyung T. Han
This article has no evaluationsLatest version Apr 17, 2026
Attention Heatmap Drift in a Contrastively Pretrained Vision–Language Model: A Controlled Matched-Learning-Rate Comparison of Full Fine-Tuning and Low-Rank Adaptation

This article has 1 author:
1. Ruize Xia
This article has no evaluationsLatest version Apr 6, 2026
Introducing a fusion model of language content attention mechanisms and structural embeddings to achieve automatic scoring of English writing

This article has 1 author:
1. Bingling Chen
This article has no evaluationsLatest version Apr 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Stacked Ensemble Learning for Content-Based Item Difficulty Prediction

Attention Heatmap Drift in a Contrastively Pretrained Vision–Language Model: A Controlled Matched-Learning-Rate Comparison of Full Fine-Tuning and Low-Rank Adaptation

Introducing a fusion model of language content attention mechanisms and structural embeddings to achieve automatic scoring of English writing