Human-like monocular depth biases in deep neural networks

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Human depth perception from 2D images is systematically distorted, yet the nature of these distortions is not fully understood. To gain insights into this fundamental problem, we compare human depth judgments with those of deep neural networks (DNNs), which have shown remarkable abilities in monocular depth estimation. Using a novel human-annotated dataset of natural indoor scenes and a systematic analysis of absolute depth judgments, we investigate error patterns in both humans and DNNs. Employing exponential-affine fitting, we decompose depth estimation errors into depth compression, per-image affine transformations (including scaling, shearing, and translation), and residual errors. Our analysis reveals that human depth judgments exhibit systematic and consistent biases, including depth compression, a vertical bias (perceiving objects in the lower visual field as closer), and consistent per-image affine distortions across participants. Intriguingly, we find that DNNs with higher accuracy partially recapitulate these human biases, demonstrating greater similarity in affine parameters and residual error patterns. This suggests that these seemingly suboptimal human biases may reflect efficient, ecologically adapted strategies for depth inference from inherently ambiguous monocular images. However, while DNNs capture metric-level residual error patterns similar to humans, they fail to reproduce human-level accuracy in ordinal depth perception within the affine-invariant space. These findings underscore the importance of evaluating error patterns beyond raw accuracy, providing new insights into how humans and computational models resolve depth ambiguity. Our dataset and methodology provide a framework for evaluating the alignment between computational models and human perceptual biases, thereby advancing our understanding of visual space representation and guiding the development of models that more faithfully capture human depth perception.

Author summary

Understanding the characteristics of errors in depth judgments exhibited by humans and deep neural networks (DNNs) provides a foundation for developing functional models of human brain and artificial models with enhanced interpretability. To address this, we constructed a human depth judgment dataset using indoor photographs and compared human depth judgments with those of DNNs. Our results show that humans systematically compress far distances and exhibit distortions related to viewpoint shift, which remain remarkably consistent across observers. Strikingly, the better the DNNs were at depth estimation, the more they also exhibited human-like biases. This suggests that these seemingly suboptimal human biases could in fact reflect efficient strategies for inferring 3D structure from ambiguous 2D inputs. However, we also found a limit: while DNNs mimicked some human errors, they weren’t as good as humans at judging the relative order of objects in depth, especially when we accounted for viewpoint distortions. We believe that our dataset and discovery of multiple error factors will drive further comparative studies between humans and DNNs, facilitating model evaluations that go beyond simple accuracy to uncover how depth perception truly works—and how it might best be replicated in computational models.

Article activity feed