Synthetic respondents and the illusion of human data: Rethinking online data collection in the age of generative AI
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Online surveys and browser-based behavior experiments rest on a simple inference: careful, coherent responding is taken as evidence of human participation. I argue that this inference is no longer safe. Autonomous AI agents—and participants who use generative AI as a writing or decision aid—can now produce responses that pass many (and sometimes nearly all) conventional quality checks, yielding data that look human-generated while embedding systematic, model-shaped distortions. Drawing on emerging evidence, I describe how even modest contamination can shift estimated public opinion, compress attitudinal extremes, and create self-reinforcing feedback loops in which AI-influenced “human” data become inputs to future models and future measurement. Because detection is an asymmetric arms race that researchers are structurally positioned to lose, I recommend a shift from bot-hunting to infrastructure redesign: contamination-aware inference and sensitivity analyses, clearer distinctions between exploratory convenience samples and high-assurance confirmatory sampling, and journal/platform incentives that treat data authentication as shared scientific infrastructure in an AI-saturated research ecosystem.