Towards Transparent Urban Perception: A Concept-Driven Framework with Visual Foundation Models

Yixin Yu
Zepeng Yu
Xuhua Shi
Ran Wan
Bowen Wang
Jiaxin Zhang

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Understanding urban visual perception plays a vital role in modeling how people cognitively and emotionally respond to the built environment. However, conventional survey-based methods are limited in scalability and spatial generalization. To address this, we present a transparent and interpretable framework that leverages recent advances in Visual Foundation Models (VFMs) and concept-based reasoning. Our approach, UP-CBM, constructs a task-specific concept vocabulary using GPT-4o and processes urban scene images via a multi-scale visual prompting strategy. This strategy generates CLIP-based similarity maps that supervise the learning of an interpretable bottleneck layer, enabling transparent reasoning between raw visual inputs and perceptual outcomes. Through comprehensive experiments on Place Pulse 2.0 (+0.041 in comparison accuracy, +0.029 in R2) and VRVWPR (+0.018 in classification accuracy), UP-CBM demonstrates superior predictive performance and transparency. These results underscore the value of combining VFMs with structured concept pipelines for robust and scalable urban visual data processing.

Version published to 10.20944/preprints202505.1819.v1
May 23, 2025

A Short Review on Computer Vision: Visualizing the World Through Machine

This article has 3 authors:
1. Mohammad Mehedi Hassan
2. Stephen Karungaru
3. Rezaul Bashar
This article has no evaluationsLatest version Jun 9, 2025
Transferring population group knowledge from multimodal large language model to small model: using urban safety perception evaluation as case study

This article has 5 authors:
1. Ce Hou
2. Fan Zhang
3. Yuhao Kang
4. Zhuangyuan Fan
5. Sen Li
This article has no evaluationsLatest version May 8, 2025
Divergent Roles of Visual Structure and Conceptual Meaning in Scene Detection and Categorization

This article has 3 authors:
1. Sage Aronson
2. Maria Adkins
3. Michelle Greene
This article has no evaluationsLatest version May 31, 2025

Listed in

Abstract

Article activity feed

Related articles

A Short Review on Computer Vision: Visualizing the World Through Machine

Transferring population group knowledge from multimodal large language model to small model: using urban safety perception evaluation as case study

Divergent Roles of Visual Structure and Conceptual Meaning in Scene Detection and Categorization