AI for Survey Design: Generating and Evaluating Survey Questions with Large Language Models

Anna Fuchs
Anna-Carolina Haensch
Wiebke Weber

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Designing survey questions is easy; however designing good survey questions is a complex task. Large language models (LLMs) have the potential to support this task by automating parts of the item-generation process, but their suitability for survey research has not yet been systematically evaluated. Published research in this area remains sparse, and little is known about the quality and characteristics of survey items generated by LLMs or the factors influencing their performance. This work provides the first in-depth analysis of LLM-based survey item generation and systematically evaluates how different design choices affect item quality. Five LLMs, namely GPT-4o, GPT-4o-mini, GPT-oss-20B, LLaMA 3.1 8B, and LLaMA 3.1 70B, were used to generate survey items on four substantive domains: work, living conditions, national politics, and recent politics.We additionally evaluate three prompting strategies: zero-shot, role, and chain-of-thought prompting.To assess the quality of the generated survey items, we use the Survey Quality Predictor (SQP), a tool for estimating the quality of attitudinal survey items based on codings of their formal and linguistic characteristics. To code these characteristics, we used an LLM-assisted procedure.The findings show striking differences in survey item characteristics across the different models and prompting techniques. Both the choice of model and the prompting technique employed influence the quality of LLM-generated survey items. Closed-source GPT models generally produce more consistent items than open-source LLaMA models. Overall, chain-of-thought prompting achieved the best results. GPT-4o, GPT-4o-mini, and LLaMA 3.1 70B achieved similar item quality, while the LLaMA model showed greater variability.

Version published to 10.31235/osf.io/fzn7t_v1 on OSF Preprints
Mar 12, 2026

Leveraging AI for Automatic Item Generation for Psychological Scales

This article has 3 authors:
1. Xijuan Zhang
2. Kentaro Suzuki
3. Kai Wen Zhou
This article has no evaluationsLatest version Apr 7, 2026
Large Language Models for Material Science: A Systematic Review

This article has 2 authors:
1. Cecília Coelho
2. Oliver Niggemann
This article has no evaluationsLatest version Apr 14, 2026
Scaling Open-Ended Survey Coding: An LLM Pipeline Where Definitions Do the Heavy Lifting

This article has 1 author:
1. Chris Soria
This article has no evaluationsLatest version Mar 20, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Leveraging AI for Automatic Item Generation for Psychological Scales

Large Language Models for Material Science: A Systematic Review

Scaling Open-Ended Survey Coding: An LLM Pipeline Where Definitions Do the Heavy Lifting