Semantic Saliency from Multi-Modal Large Language Model Scene Understanding Maps

Shravan Murlidaran
Ziqi Wen
Jonathan Skaza
William Wang
Miguel P Eckstein

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Image-computable low-level image saliency has shaped perceptual psychology and computer science, but is limited by its inability to capture human high-level cognitive factors that influence attention and eye movements. A recent paper has shown that while free-viewing scenes, participants fixate on objects that are critical to the understanding of scenes. Here, we propose a fully automated method (with no fitting parameters or training) to create scene understanding maps (SUMs) that visualize the quantitative contribution of an object to the human comprehension of the scene. The method (AUTO-SUM) uses auto-segmentation and removal of different objects from scenes, Multi-Modal Large Language Models (MLLMs) to describe the scenes, and semantic similarity measures using large language embedding representations of scene descriptions. We show that AUTO-SUMs can approximate H-SUMs estimated using human-operated segmentation and object removal, human scene descriptions, and sentence similarity ratings. We also show that AUTO-SUM can predict the object most fixated by human observers during free viewing and scene description tasks better than a saliency model (GBVS) and comparable to DeepGaze. We contend that AUTO-SUM can be used as a semantic saliency model that complements lower-level saliency models.

Version published to 10.31234/osf.io/x3mfa_v1 on OSF Preprints
Aug 1, 2025

Does Human-Like Contextual Object Recognition Emerge from Language Supervision and Language-Guided Inference?

This article has 3 authors:
1. Karim Rajaei
2. Radoslaw Martin Cichy
3. Hamid Soltanian-Zadeh
This article has no evaluationsLatest version Jul 24, 2025
A method for automatically generating semantic information distribution maps of images

This article has 1 author:
1. Ke Zhang
This article has no evaluationsLatest version Jul 18, 2025
Orchestrating Visual and Linguistic Modalities for Robust Spatial Intelligence in LVLMs

This article has 3 authors:
1. Jie-Hao Lim
2. Carter Ross
3. Gavin Walker
This article has no evaluationsLatest version Jul 5, 2025

Listed in

Abstract

Article activity feed

Related articles

Does Human-Like Contextual Object Recognition Emerge from Language Supervision and Language-Guided Inference?

A method for automatically generating semantic information distribution maps of images

Orchestrating Visual and Linguistic Modalities for Robust Spatial Intelligence in LVLMs