Large language models outperform humans at estimating society's everyday norms—but hybrids are even better
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
As AI assistants and social robots enter human environments, their ability to navigate context-dependent social norms is essential to avoid harm and ensure successful collaboration. We evaluate six large language models (LLMs) on their ability to estimate American social norms across 555 everyday scenarios (measured in prior work) and compare these to estimates from 320 humans. LLMs achieve remarkably high accuracy, clearly outperforming the average human. However, the errors LLMs make are systematic; they are similar across runs of the same LLM and even across different LLMs. As a consequence of this homogeneity, aggregating estimates of LLMs produces little improvement. Individual humans make much worse estimates, often defaulting to extreme right-or-wrong judgments even when asked to estimate population averages, but their errors are idiosyncratic and, consequently, aggregating their estimates yields dramatic improvement through wisdom-of-crowds effects. As humans make different errors than LLMs, hybrid ensembles combining both substantially outperform either alone.