Low data image based urban noise estimation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Self-supervised representation learning (SSL) methods have shown advances in learning invariant representations for computer vision tasks. However, their robustness under practical deployment conditions remains underexplored. Urban noise pollution represents a critical environmental challenge with significant public health implications. Traditional noise monitoring requires expensive sensor networks, limiting scalability. We introduce a machine learning framework predicting background noise from casual smartphone imagery with only 400 samples. Our pipeline combines transformer-based semantic segmentation (SegFormer) with interpretable hand-crafted features (natural/ manmade proportions, movable-object metrics, visual complexity) fed into Support Vector Regression. Critically, we introduce perceptually-grounded evaluation based on the ±3 dB just-noticeable difference threshold established in psychoacoustic research. Perceptually-adjusted evaluation (MAE=2.92 dB, RMSE=4.53 dB, R2 = 0.61) demonstrates accuracy comparable to large-scale systems while requiring dramatically fewer resources, enabling scalable citizenscience driven urban acoustic monitoring.