Learning to See the Elephant in the Room: Self-Supervised Context Reasoning in Humans and AI

Mengmi Zhang
Xiao Liu
Soumick Sarker
Ankur Sikarwar
Bryan Kiely
Gabriel Kreiman
Zenglin Shi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

A tiny object on the table might be a fork, but not an elephant. Humans rarely perceive objects in isolation; instead, they interpret scenes through relationships among co-occurring elements. But how do humans learn these contextual associations? We address this question through a series of human psychophysics experiments. First, we designed a set of context rules by replacing familiar household objects with novel ones. These rules capture different types of associations: global context (e.g., a toothbrush typically appears in bathrooms), local context (e.g., a fork often appears near a plate, even across rooms), and crowding effects (e.g., eggs tend to cluster together). Participants were exposed to short training videos showing these novel objects embedded in naturalistic scenes. We then tested their contextual reasoning using a “lift-the-flap” task, where the central object was hidden, and participants had to infer its identity based on the surrounding context. We also introduced contextual variations in the task by changing the size, resolution, and spatial arrangement of the scene context. Results show that humans can acquire contextual rules in a self-supervised manner without labels or feedback, and can robustly infer the hidden object across a range of contextual variations. To model this human capability, we introduce SeCo (Self-supervised learning for Context reasoning), a novel computational model for learning contextual associations. SeCo first identifies candidate target regions, then encodes the target and surrounding context using separate vision encoders. Inspired by semantic memory in biological brains, SeCo includes a learnable external memory module that stores latent contextual priors. Given a contextual cue, SeCo infers the identity of a hidden object by retrieving a likely object representation from this memory and regressing it toward the actual target. In contrast to existing SSL methods that focus on object-centric learning from single-object images, SeCo explicitly learns contextual relationships from complex scenes. Our results show that SeCo outperforms state-of-the-art SSL methods on the lift-the-flap task. Network analysis reveals that its external memory stores meaningful contextual knowledge, enabling accurate inference. Moreover, we also extend the context reasoning ability of SeCo, state-of-the-art SSL methods, and humans to object priming tasks, where they are asked to place target objects in context-appropriate locations. SeCo predicts object placements most closely aligned with human behavior. In learning to see the elephant in the room, both humans and SeCo reveal that scene understanding arises not from objects alone, but from the contextual associations that bind them together.

Version published to 10.21203/rs.3.rs-8942453/v1 on Research Square
Apr 2, 2026

Multimodal large language models converge on the human-like geometry of abstract emotion

This article has 7 authors:
1. Huiguang He
2. Changde Du
3. Yizhuo Lu
4. Zhongyu Huang
5. Yi Sun
6. Zisen Zhou
7. Shaozheng Qin
This article has no evaluationsLatest version Apr 2, 2026
The First 1000 Days: An Agent-Based Model of Early Language Acquisition

This article has 11 authors:
1. Hadas Raviv
2. Arkadii Tsyhanov
3. Kira Gousios
4. Aja Altenhof
5. Haocheng Wang
6. Berlin Chen
7. Ofri Raviv
8. Tal Rosenwein
9. Casey Lew-Williams
10. Liat Hasenfratz
11. Uri Hasson
This article has no evaluationsLatest version Mar 31, 2026
How we draw and recognize things that don’t exist

This article has 5 authors:
1. Emily J. A-Izzeddin
2. Filipp Schmidt
3. Christian Houborg
4. Henning Tiedemann
5. Roland W. Fleming
This article has no evaluationsLatest version Apr 23, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Multimodal large language models converge on the human-like geometry of abstract emotion

The First 1000 Days: An Agent-Based Model of Early Language Acquisition

How we draw and recognize things that don’t exist