Inferring the Public Mind: Accuracy and Biases in Out-of-Sample Public Opinion Estimation with Large Language Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The emergence of large language models (LLMs) offers a promising cost-effective alternative for assessing public opinion. However, most prior research has focused on simulating individual personas using past surveys, leaving it unclear whether LLMs can accurately estimate out-of-sample public opinion at the societal level. We address this gap with three studies that systematically evaluate the accuracy, bias, and variability of commercial and non-commercial LLMs in inferring societal-level public opinion. First, we evaluated GPT-3.5 and GPT-4 on four recent Pew Research Centre polls, finding that both models outperformed traditional baseline methods, with GPT-4 showing higher accuracy, with an error rate ranging from 4% to 9%. Notably, GPT-3.5 produced overly uniform responses characterized by lower variance and higher entropy compared to actual human data, whereas GPT-4’s predictions more closely mirrored real response patterns. Both models, however, exhibited partisan skew, aligning more strongly with Democratic perspectives and underrepresenting Republican viewpoints. Second, we assessed the generalisability of these findings by testing various models (GPT-4o-mini, GPT-4o, and Llama 3.3) on three additional Pew surveys. These models demonstrated similar predictive accuracy and a continued tendency toward less varied responses, though partisan bias was less pronounced in the latest models. Third, we examined the robustness of our findings through ablation studies of prompt design by varying parameters such as temperature, role, and input format and found that the results were consistent across these conditions. Importantly, collective-level estimation consistently outperformed averaged individual-level simulation, which suffered from greater homogeneity bias. Overall, LLMs show strong potential for public opinion estimation, though their inherent biases and methodological limitations warrant careful consideration.

Article activity feed