Multimodal test item parameter prediction from text, images, and metadata: Fusing together AI vision and language models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We propose a flexible multimodal model for predicting all dichotomous and polytomous item parameters from text, images, and metadata by fusing representations from encoder Transformer vision and language models. This deep learning model accommodates heterogeneous item formats, including items with any numbers of components such as correct and incorrect options, stimuli, and images. Answer key indicators distinguish correct options from distractors, while an attention pooling technique weights the relative importance of these components. Based on the 2-parameter logistic model and the generalized partial credit model, we predict all item parameters jointly using a masking strategy to ensure that only relevant parameters contribute to the loss. Item-level and item component-level metadata are also included. We evaluate the approach using 40,965 English language arts and mathematics items for grades 3-11. A single model accommodated both exam subjects, all eleven item types, and all item parameters, eliminating the need for multiple specialized models. However, results indicate that the full model was unable to leverage all input data. Often, prediction accuracy was unchanged as features were removed. Images were good predictors on their own (item intercept $R^2=.25$), but did not consistently contribute unique information when combined with text and metadata. Of the strongest performing models, the most parsimonious variant achieved $R^2$ values of .67, .45, .39, .34, .66, .33, and .74 for item intercept, discrimination, difficulty, and four polytomous step threshold parameters, respectively. Findings suggest that current training methods may limit the learnability of complex multimodal deep fusion models.