An Empirical Investigation into the Utility of Large Language Models in Open-Ended Survey Data Categorization

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Can social scientists use large language models (LLMs) to code open-ended survey responses across complexity levels. This study uses the UC Berkeley Social Networks Study as a test case, comparing GPT-4o, Claude Sonnet 3.7, Llama 3.1 variant Sonnar Large, and Mistral Large against human annotators, we find the two proprietary models outperform open-source alternatives, achieving 97% accuracy on straightforward questions and 88-91% on complex interpretive tasks. However, open-source models still perform relatively well, achieving 95-96% accuracy on relatively straightforward questions up to 87% on more complex ones. Performance analysis revealed response brevity, but not category brevity, as a strong determinant of successful classification with responses under 50 characters demonstrating 7-11% higher classification accuracy. While models effectively detected nuance in simpler tasks, they struggled with responses containing multiple reasons, narratives, and implicit meanings.The findings imply that social science researchers who want to use LLMs should design questions to elicit concise responses (10-50 characters), implement human-in-the-loop review for complex tasks, and carefully select appropriate models based on task complexity. Minimal demographic variation in classification accuracy was observed, with Claude uniquely maintaining consistent performance across population segments. Methodologically concerning, some models sometimes produced different social narratives from open-ended classification despite high accuracy metrics.

Article activity feed