Beyond Data Moore’s Law: Towards Sustainable Scaling of Foundation Models

Feng Chen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The recent progress of large language and multimodal models has been widely attributed to a de facto “data Moore’s law”: as model parameters and training tokens increase, performance improves in a predictable manner across diverse benchmarks. However, this paradigm is rapidly approaching multiple limits. High-quality web-scale text is close to saturation, additional data is increasingly redundant, and the financial and environmental costs of continued brute-force scaling are becoming unsustainable. At the same time, the capabilities that matter most for science, engineering, and society—robust reasoning, continual learning, and safe deployment—do not simply emerge from ever-larger piles of uncurated data. In this Perspective, we argue that the next phase of foundation model development must shift from maximizing data volume to optimizing effective information content and ecosystem design. We first analyse the empirical and conceptual constraints of current scaling practices, including data exhaustion, diminishing returns, and misalignment between benchmarks and real-world tasks [1–9]. We then expand the lens from single models to multi-agent LLM ecosystems, where collections of interacting agents, tools, and environments form scalable scientific workflows [10–14]. Drawing an explicit analogy to complex interfacial phenomena in fluid mechanics and condensation—where macroscopic behaviour emerges from local interactions among droplets, contact lines, and patterned substrates—we show how architectural heterogeneity, controlled pinning, active gradients, confinement, and phase-diagram thinking provide concrete design principles for multi-agent systems [15–24]. Building on recent advances in data-centric AI and synthetic data scaling [25–33], we propose a framework that decomposes data quality into four dimensions—coverage, compositionality, conflict, and controllability—and argue that these, rather than raw token counts, will define a realistic “Moore’s law of data” for the next decade. Finally, we discuss implications for evaluation and governance, including holistic multi-agent benchmarks, ecosystem-level documentation, and alignment methods that treat scientific LLM ecosystems as institutions in their own right [34–38]. Rather than asking how many more tokens we can consume, we suggest that the central question for the coming decade is how to build sustainable, data-efficient, and well-governed ecosystems in which models, experiments, and human communities co-evolve.

Version published to 10.20944/preprints202511.1245.v1
Nov 17, 2025

Small Language Models: Architecture, Evolution, and the Future of Artificial Intelligence

This article has 5 authors:
1. Ankit Parag Shah
2. Mohammad-Parsa Hosseini
3. Su Min Park
4. Connie Miao
5. Wei Wei
This article has no evaluationsLatest version Jan 13, 2026
TSLN: Time-Series Lean Notation

This article has 2 authors:
1. Manas Mudbari
2. Chandan Bhagat
This article has no evaluationsLatest version Jan 29, 2026
Measuring Nature’s Contributions to People – what data do we have?

This article has 5 authors:
1. Sebastian Theis
2. Neil Burgess
3. Arnout van Soesbergen
4. Kate Egan
5. Andrew Gonzalez
This article has no evaluationsLatest version Jan 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Small Language Models: Architecture, Evolution, and the Future of Artificial Intelligence

TSLN: Time-Series Lean Notation

Measuring Nature’s Contributions to People – what data do we have?