Multimodal Human Perception of Object Dimensions: Evidence from Deep Neural Networks And Large Language Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Human object recognition relies on both perceptual and semantic dimensions. Here, we examined how deep neural networks (DNNs) and large language models (LLMs) capture and integrate human-derived dimensions of object similarity. We extracted layer activations from CORnet-S and obtained BERT embeddings for 1853 images from the THINGS dataset. We used support vector regression (SVR) to quantify explained variance in human-derived dimensions. Results showed that multimodal integration improved predictions in early visual processing but offers limited additional benefits at later stages, suggesting that deep perceptual processing already encodes meaningful object representations.