It Knew Too Much: On the Unsuitability of LLMs as Replacements for Human Subjects

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Psychometric and moral benchmarks are increasingly used to evaluate large language models (LLMs), aiming to measure their capabilities, surface implicit biases, and assess alignment with human values. However, interpreting LLM responses to these benchmarks is methodologically challenging, a nuance often overlooked in existing literature. We empirically demonstrate that LLM responses to a standard psychometric benchmark (generalized trust from the World Values Survey) correlate strongly with known survey results across language communities. Critically, we observe LLMs achieve this while explicitly referencing known survey results and the broader literature, even without direct prompting. We further show these correlations can be amplified or effectively eliminated by subtle changes in evaluation task design, revealing that replicating known results does not validate LLMs as naive subjects. Given LLMs' access to relevant literature, their ability to replicate known human behavior constitutes an invalid evaluation for assessing the suitability of large language models as naive subjects. Fascinating though it may be, this ability provides no evidence of generalizability to novel or out-of-sample behaviors. We discuss implications for alignment research and benchmarking practices.

Article activity feed