The threat of synthetic respondents extends to clinical mental health screening
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Importance: Large language models (LLMs) can autonomously complete online surveys with human-like plausibility, evade standard quality-control checks, and do so at negligible cost. This poses a serious threat to the integrity of remotely collected research data. Whether this threat extends to validated clinical psychiatric screening instruments, which serve as gatekeepers for study eligibility and primary outcome measures in mental health research, has not yet been examined.Objective: To determine whether a commercially available LLM, provided only with brief diagnostic persona descriptions, can produce clinically differentiated and severity-sensitive responses across a broad battery of validated psychiatric screening instruments.Design, Setting, and Participants: This simulation study used Google Gemini 2.0 Flash to generate 2106 unique synthetic personas from 13 DSM-informed clinical diagnostic profiles, three severity levels, and demographic variables. Diagnostic profiles spanned mood disorders, anxiety disorders, obsessive-compulsive disorders, psychotic disorders, eating disorders, and neurodegenerative conditions.Main Outcomes and Measures Scores on seven validated clinical instruments administered to each persona: the Patient Health Questionnaire–9 (PHQ-9), Generalized Anxiety Disorder–7 (GAD-7), Obsessive-Compulsive Inventory–Revised (OCI-R), PTSD Checklist for DSM-5 (PCL-5), Mood Disorder Questionnaire (MDQ), Eating Disorder Examination Questionnaire (EDE-Q), and Prodromal Questionnaire–16 (PQ-16). The primary outcome was whether synthetic personas generated diagnosis-congruent scores exceeding established clinical cutoffs.Results: LLM-generated personas produced clinically differentiated patterns of diagnostic specificity across groups and instruments. Five of seven instruments showed significantly higher scores among diagnosis-congruent personas (all P < .001), and the two instruments typically elevated across multiple disorders (PHQ-9 and GAD-7) showed no differentiation between target and nontarget clinical controls (P = .915 and P = .306, respectively). Scores increased monotonically with assigned severity across all seven instruments (all P < .001).Conclusions and Relevance: A standard commercial LLM, given only brief diagnostic descriptions, can generate clinically plausible and severity-sensitive responses on validated psychiatric screening instruments without any specialized configuration. Coherent symptom endorsement above clinical thresholds can no longer serve as a proxy for authentic participation when recruiting online samples, making the development of detection methods calibrated to clinical response patterns an urgent priority.