Leveraging AI for Automatic Item Generation for Psychological Scales
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study examined how large language models (LLMs) can be used for automated item generation (AIG) in psychological scale development, focusing on how prompting strategies and LLM brands and versions influence item quality for both an established and a novel construct. Using three prompting conditions across eight LLMs from four major providers, we generated AI items and evaluated them through expert review, AI-psychometric analyses, and lexical diversity metrics. The findings highlight both the potential and limitations of AIG. Many AI-generated items were clear, concise, and theoretically meaningful, and for the novel construct, some items captured nuanced facets beyond the researcher’s initial conceptualization. At the same time, other items were overly general or tapped unintended dimensions, underscoring the need for careful human screening. Prompt engineering was the strongest determinant of item quality, requiring deliberate design to avoid reproducing existing scale items and to achieve the desired balance of specificity and conciseness. Overall, AIG appears to be a useful tool for early-stage scale development when paired with thoughtful prompt design and human oversight.