Synthetic respondents and the illusion of human data: Rethinking online data collection in the age of generative AI

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Online surveys and browser-based behavior experiments rest on a simple inference: careful, coherent responding is taken as evidence of human participation. I argue that this inference is no longer safe. Autonomous AI agents—and participants who use generative AI as a writing or decision aid—can now produce responses that pass many (and sometimes nearly all) conventional quality checks, yielding data that look human-generated while embedding systematic, model-shaped distortions. Drawing on emerging evidence, I describe how even modest contamination can shift estimated public opinion, compress attitudinal extremes, and create self-reinforcing feedback loops in which AI-influenced “human” data become inputs to future models and future measurement. Because detection is an asymmetric arms race that researchers are structurally positioned to lose, I recommend a shift from bot-hunting to infrastructure redesign: contamination-aware inference and sensitivity analyses, clearer distinctions between exploratory convenience samples and high-assurance confirmatory sampling, and journal/platform incentives that treat data authentication as shared scientific infrastructure in an AI-saturated research ecosystem.

Article activity feed