Chatbots Are Undermining Crowdsourced Research in the Behavioral Sciences: Detecting AI-Assisted Cheating with a Keystroke-Based Tool
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Generative AI poses a significant threat to data integrity on crowdsourcing platforms like Prolific, which behavioral scientists widely rely on for data collection. Large language models (LLMs) allow users to generate fluent and relevant responses to open-ended questions, which can mask inattention and compromise experimental validity. To empirically estimate the prevalence of this behavior, we analyzed keystroke data from three studies (N = 928) on Prolific between May and July 2025. Using an embedded JavaScript tool, we flagged participants who pasted text or whose keystroke count was anomalously low compared to their response length. For each flagged participant, we manually compared their detected keystrokes to their final response to determine if the text could have been plausibly typed. This process confirmed that, despite deterrence measures, approximately 9% of all participants submitted AI-assisted responses. These participants significantly outperformed non-cheaters (by up to 1.5 SDs), were over twice as likely to share geolocations with other participants (suggesting possible VPN or proxy use), and exhibited lower reliability on questionnaire scales. Simulated power analyses indicate that this level of undetected cheating can diminish observed effect sizes by 10% and inflate required sample sizes by as much as 30%. These findings highlight the urgent need for new detection methods like keystroke logging, which offers verifiable evidence of cheating that is difficult to obtain from manual review of LLM-generated text alone. As AI continues to evolve, maintaining data quality in crowdsourced research will require active monitoring, methodological adaptation, and communication between researchers and data collection platforms.