Towards Transparent Urban Perception: A Concept-Driven Framework with Visual Foundation Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Understanding urban visual perception is crucial for modeling how individuals cognitively and emotionally interact with the built environment. However, traditional survey-based approaches are limited in scalability and often fail to generalize across diverse urban contexts. In this study, we introduce the UP-CBM, a transparent framework that leverages visual foundation models (VFMs) and concept-based reasoning to address these challenges. The UP-CBM automatically constructs a task-specific vocabulary of perceptual concepts using GPT-4o and processes urban scene images through a multi-scale visual prompting pipeline. This pipeline generates CLIP-based similarity maps that facilitate the learning of an interpretable bottleneck layer, effectively linking visual features with human perceptual judgments. Our framework not only achieves higher predictive accuracy but also offers enhanced interpretability, enabling transparent reasoning about urban perception. Experiments on two benchmark datasets—Place Pulse 2.0 (achieving improvements of +0.041 in comparison accuracy and +0.029 in R2) and VRVWPR (+0.018 in classification accuracy)—demonstrate the effectiveness and generalizability of our approach. These results underscore the potential of integrating VFMs with structured concept-driven pipelines for more explainable urban visual analytics.