The Social Structure of Scientific Evaluation: AI, Benchmarking, and the Deep Learning Monoculture
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Evaluation systems are central organizing institutions in science that coordinate consensus and drive epistemic trajectories. Scientific fields have traditionally relied on "organic" evaluation systems (e.g., peer review, citation) where consensus emerges gradually across multiple epistemic values. This paper highlights artificial intelligence research (AIR) as a potent counterpoint to this model. Drawing on interviews with key actors, computational analyses, and archival materials spanning AIR’s history (1956–2021), we examine how AI evolved from a discipline with weak organic evaluation into a field driven by benchmarking, a “formal” evaluation system that defines progress quantitatively as state-of-the-art accuracy on commercial tasks. We demonstrate that benchmarking came to dominate through an intricate symbiosis with deep learning: benchmarking rewards accuracy, which large-scale deep learning uniquely excelled at, while deep learning’s opacity made organic evaluation increasingly difficult. This symbiosis restructured the field organizationally, epistemically, and materially into a “monoculture” dedicated to scaling. While enabling breakneck progress, monoculture discouraged exploration of alternatives with different epistemic strengths. As AI spreads to other knowledge fields (from science to law to art) benchmarking will accompany it. Our findings thus highlight the risk that formalization of evaluation can lead to monoculture in other creative domains.