A Primer for Evaluating Large Language Models in Social Science Research
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Autoregressive Large Language Models (LLMs) exhibit remarkable conversational and reasoning abilities, and exceptional flexibility across a wide range of tasks. Subsequently, LLMs are being increasingly used in scientific research, to analyze data, generate synthetic data, or even to write scientific papers. This trend necessitates that authors follow best practices for conducting and reporting LLM research and that journal reviewers are able to evaluate the quality of works that use LLMs. We provide authors of social scientific research with essential recommendations to ensure replicable and robust results using LLMs. Our recommendations also highlight considerations for reviewers, focusing on methodological rigor, replicability, and validity of results when evaluating studies that use LLMs to automate data processing or simulate human data. We offer practical advice on assessing the appropriateness of LLM applications in submitted studies, emphasizing the need for transparency in methodological reporting and the challenges posed by the non-deterministic and continuously evolving nature of these models. By providing a framework for best practices and critical review, this primer aims to ensure high-quality, innovative research within the evolving landscape of social science studies using LLMs.