Learning to See the Elephant in the Room: Self-Supervised Context Reasoning in Humans and AI

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

A tiny object on the table might be a fork, but not an elephant. Humans rarely perceive objects in isolation; instead, they interpret scenes through relationships among co-occurring elements. But how do humans learn these contextual associations? We address this question through a series of human psychophysics experiments. First, we designed a set of context rules by replacing familiar household objects with novel ones. These rules capture different types of associations: global context (e.g., a toothbrush typically appears in bathrooms), local context (e.g., a fork often appears near a plate, even across rooms), and crowding effects (e.g., eggs tend to cluster together). Participants were exposed to short training videos showing these novel objects embedded in naturalistic scenes. We then tested their contextual reasoning using a “lift-the-flap” task, where the central object was hidden, and participants had to infer its identity based on the surrounding context. We also introduced contextual variations in the task by changing the size, resolution, and spatial arrangement of the scene context. Results show that humans can acquire contextual rules in a self-supervised manner without labels or feedback, and can robustly infer the hidden object across a range of contextual variations. To model this human capability, we introduce SeCo (Self-supervised learning for Context reasoning), a novel computational model for learning contextual associations. SeCo first identifies candidate target regions, then encodes the target and surrounding context using separate vision encoders. Inspired by semantic memory in biological brains, SeCo includes a learnable external memory module that stores latent contextual priors. Given a contextual cue, SeCo infers the identity of a hidden object by retrieving a likely object representation from this memory and regressing it toward the actual target. In contrast to existing SSL methods that focus on object-centric learning from single-object images, SeCo explicitly learns contextual relationships from complex scenes. Our results show that SeCo outperforms state-of-the-art SSL methods on the lift-the-flap task. Network analysis reveals that its external memory stores meaningful contextual knowledge, enabling accurate inference. Moreover, we also extend the context reasoning ability of SeCo, state-of-the-art SSL methods, and humans to object priming tasks, where they are asked to place target objects in context-appropriate locations. SeCo predicts object placements most closely aligned with human behavior. In learning to see the elephant in the room, both humans and SeCo reveal that scene understanding arises not from objects alone, but from the contextual associations that bind them together.

Article activity feed