Beyond Data Moore’s Law: Towards Sustainable Scaling of Foundation Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The recent progress of large language and multimodal models has been widely attributed to a de facto “data Moore’s law”: as model parameters and training tokens increase, performance improves in a predictable manner across diverse benchmarks. However, this paradigm is rapidly approaching multiple limits. High-quality web-scale text is close to saturation, additional data is increasingly redundant, and the financial and environmental costs of continued brute-force scaling are becoming unsustainable. At the same time, the capabilities that matter most for science, engineering, and society—robust reasoning, continual learning, and safe deployment—do not simply emerge from ever-larger piles of uncurated data. In this Perspective, we argue that the next phase of foundation model development must shift from maximizing data volume to optimizing effective information content and ecosystem design. We first analyse the empirical and conceptual constraints of current scaling practices, including data exhaustion, diminishing returns, and misalignment between benchmarks and real-world tasks [1–9]. We then expand the lens from single models to multi-agent LLM ecosystems, where collections of interacting agents, tools, and environments form scalable scientific workflows [10–14]. Drawing an explicit analogy to complex interfacial phenomena in fluid mechanics and condensation—where macroscopic behaviour emerges from local interactions among droplets, contact lines, and patterned substrates—we show how architectural heterogeneity, controlled pinning, active gradients, confinement, and phase-diagram thinking provide concrete design principles for multi-agent systems [15–24]. Building on recent advances in data-centric AI and synthetic data scaling [25–33], we propose a framework that decomposes data quality into four dimensions—coverage, compositionality, conflict, and controllability—and argue that these, rather than raw token counts, will define a realistic “Moore’s law of data” for the next decade. Finally, we discuss implications for evaluation and governance, including holistic multi-agent benchmarks, ecosystem-level documentation, and alignment methods that treat scientific LLM ecosystems as institutions in their own right [34–38]. Rather than asking how many more tokens we can consume, we suggest that the central question for the coming decade is how to build sustainable, data-efficient, and well-governed ecosystems in which models, experiments, and human communities co-evolve.

Article activity feed