An Empirical Investigation into the Utility of Large Language Models in Open-Ended Survey Data Categorization

Chris Soria

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Can social scientists use large language models (LLMs) to code open-ended survey responses across complexity levels. This study uses the UC Berkeley Social Networks Study as a test case, comparing GPT-4o, Claude Sonnet 3.7, Llama 3.1 variant Sonnar Large, and Mistral Large against human annotators, we find the two proprietary models outperform open-source alternatives, achieving 97% accuracy on straightforward questions and 88-91% on complex interpretive tasks. However, open-source models still perform relatively well, achieving 95-96% accuracy on relatively straightforward questions up to 87% on more complex ones. Performance analysis revealed response brevity, but not category brevity, as a strong determinant of successful classification with responses under 50 characters demonstrating 7-11% higher classification accuracy. While models effectively detected nuance in simpler tasks, they struggled with responses containing multiple reasons, narratives, and implicit meanings.The findings imply that social science researchers who want to use LLMs should design questions to elicit concise responses (10-50 characters), implement human-in-the-loop review for complex tasks, and carefully select appropriate models based on task complexity. Minimal demographic variation in classification accuracy was observed, with Claude uniquely maintaining consistent performance across population segments. Methodologically concerning, some models sometimes produced different social narratives from open-ended classification despite high accuracy metrics.

Version published to 10.31235/osf.io/wv6tk_v3 on OSF Preprints
Jun 19, 2025
Version published to 10.31235/osf.io/wv6tk_v2 on OSF Preprints
Jun 3, 2025
Version published to 10.31235/osf.io/wv6tk_v1 on OSF Preprints
May 31, 2025

An Empirical Investigation into the Utility of Large Language Models in Open-Ended Survey Data Categorization

This article has 1 author:
1. Chris Soria
This article has no evaluationsLatest version Jun 19, 2025
An Empirical Investigation into the Utility of Large Language Models in Open-Ended Survey Data Categorization

This article has 1 author:
1. Chris Soria
This article has no evaluationsLatest version Jun 19, 2025
Can Large Language Models Be Used to Code Text for Thematic Analysis? An Explorative Study

This article has 8 authors:
1. Zhiyong Han
2. Aaron Tavasi
3. JuYoung Lee
4. Joshua Luzuriaga
5. Kevin Suresh
6. Michael Oppenheim
7. Fortunato Battaglia
8. Stanley R. Terlecky
This article has no evaluationsLatest version Apr 30, 2025

Listed in

Abstract

Article activity feed

Related articles

An Empirical Investigation into the Utility of Large Language Models in Open-Ended Survey Data Categorization

An Empirical Investigation into the Utility of Large Language Models in Open-Ended Survey Data Categorization

Can Large Language Models Be Used to Code Text for Thematic Analysis? An Explorative Study