Semantic Saliency from Multi-Modal Large Language Model Scene Understanding Maps

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Image-computable low-level image saliency has shaped perceptual psychology and computer science, but is limited by its inability to capture human high-level cognitive factors that influence attention and eye movements. A recent paper has shown that while free-viewing scenes, participants fixate on objects that are critical to the understanding of scenes. Here, we propose a fully automated method (with no fitting parameters or training) to create scene understanding maps (SUMs) that visualize the quantitative contribution of an object to the human comprehension of the scene. The method (AUTO-SUM) uses auto-segmentation and removal of different objects from scenes, Multi-Modal Large Language Models (MLLMs) to describe the scenes, and semantic similarity measures using large language embedding representations of scene descriptions. We show that AUTO-SUMs can approximate H-SUMs estimated using human-operated segmentation and object removal, human scene descriptions, and sentence similarity ratings. We also show that AUTO-SUM can predict the object most fixated by human observers during free viewing and scene description tasks better than a saliency model (GBVS) and comparable to DeepGaze. We contend that AUTO-SUM can be used as a semantic saliency model that complements lower-level saliency models.

Article activity feed