How does generative artificial intelligence compare to human analysts in qualitative research?: A systematic review of large language models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Introduction: Qualitative methods offer rich insights into human experience, yet the expanding volume of textual data presents analytic challenges. Generative AI, particularly large language models (LLMs), may help researchers organize and interpret qualitative data while maintaining depth and accuracy. It is important to assess the validity and practical suitability of LLMs for qualitative analysis. This systematic review evaluated the extent to which LLMs, specifically generative transformer models such as ChatGPT, can identify themes and patterns consistent with human analysts. Methods: Studies published up to June 2025 were retrieved from PubMed, PsycINFO, Web of Science, Google Scholar, and reference lists of included studies. Eligible studies directly compared LLM- and human-generated outputs derived from thematic, narrative, or discourse analyses. Results: Thirty-nine studies met inclusion criteria. Most used versions of ChatGPT (70.8%), while others utilized Gemini (8.3%), Claude (4.2%), among others. Comparative analyses spanned diverse disciplines, though standardized protocols for conducting LLM-assisted qualitative analysis were lacking. Overall, LLM performance relative to human analysts was rated as moderate to high. Among those studies reporting quantitative indexes, the median % agreement score was 0.80 (range: 0.19-1.00), median Cohen’s Kappa was 0.74 (range 0.38-0.82), and median Cosine coefficient was 0.58 (range: 0.47-0.80), with stronger alignment for semantic or descriptive themes and lower accuracy for interpretive, context-dependent themes. Discussion: Findings provide support for LLMs as complementary tools in qualitative data analysis. Rather than replacing human interpretation, LLMs may serve as efficient aids for data organization and inter-rater comparison to support, but not substitute, qualitative engagement.