Scaling Open-Ended Survey Coding: An LLM Pipeline Where Definitions Do the Heavy Lifting

Chris Soria

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

As large language model (LLM)–based text classification becomes routine in the social sciences, researchers confront dozens of competing models, inconsistent advice on prompting, and little standardized tooling with evidence‑based defaults. CatLLM, an open‑source Python and R package, addresses this gap with a three‑stage pipeline—exploration, extraction, classification—for coding open‑ended survey responses. The package offers a provider‑agnostic interface that supports multi‑model ensembles, batch processing, and fully local deployment via open‑weight models, and can be operated through a conversational interface by researchers with no programming experience. CatLLM’s defaults are calibrated by a systematic empirical study evaluating 21 LLMs across three capability tiers, six providers, and four survey questions, benchmarked against sociologist‑coded ground truth. This validation reveals a consistent problem: all models over‑classify, with precision lagging 40–50 percentage points behind sensitivity, implying that default LLM configurations may substantially overstate category prevalence. CatLLM encodes empirically grounded mitigations as defaults: verbose category definitions with explicit inclusion and exclusion criteria, unanimous multi‑model ensembling, and an automatic “Other” escape‑valve category, while disabling advanced prompting strategies that show no reliable benefit. Ensembles of inexpensive open‑weight models outperform the best individual cloud model, enabling fully local deployment without transmitting survey data to external servers. These findings replicate on two independent public datasets spanning political and emotional text, and an applied example linking tool‑coded “move reasons” to respondent demographics uncovers distinct life‑course patterns in residential mobility.

Version published to 10.31235/osf.io/gjvcf_v1 on OSF Preprints
Mar 20, 2026

Uses and Misuses of Large Language Models in Qualitative Research

This article has 1 author:
1. Jonathan Ben-Menachem
This article has no evaluationsLatest version Mar 17, 2026
AI for Survey Design: Generating and Evaluating Survey Questions with Large Language Models

This article has 3 authors:
1. Anna Fuchs
2. Anna-Carolina Haensch
3. Wiebke Weber
This article has no evaluationsLatest version Mar 12, 2026
Synthetic Participants Generated by Large Language Models: A Systematic Literature Review

This article has 3 authors:
1. Eduard Kuric
2. Peter Demcak
3. Matus Krajcovic
This article has no evaluationsLatest version Mar 10, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Uses and Misuses of Large Language Models in Qualitative Research

AI for Survey Design: Generating and Evaluating Survey Questions with Large Language Models

Synthetic Participants Generated by Large Language Models: A Systematic Literature Review