Re-Thinking Training Data: From Tokens and Parameters to Tasks and Capabilities
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Current large language model (LLM) practice is organized around two primary axes: the number of tokens consumed during training and the number of parameters used to fit them. These proxies have served as effective scaling knobs, but they are only loosely coupled to what actually matters for scientific and real-world deployment: which capabilities the model acquires, how robustly it can exercise them, and how gracefully it fails. Training on undifferentiated text streams with aggregate loss functions treats “all tokens as equal,” while downstream users care about highly structured task families such as multi-step reasoning, tool use, code synthesis, experimental planning, or negotiation. In this Perspective, we argue for a shift from token-centric to capability-centric training. We sketch a framework in which tasks are represented as nodes and edges in a capability graph, capturing pre-requisites, compositional structure, and transfer relations. This representation enables new training regimes that explicitly target underdeveloped regions of capability space through curriculum design, data selection, and synthetic data generation. It also suggests system-level architectures where different agents specialize on subgraphs, coordinated by a higher-level planner. We further discuss how capability-centric thinking reshapes evaluation, red-teaming, and safety: instead of aggregate benchmarks, we can reason about coverage, brittleness, and failure modes at the level of subgraphs and capability clusters. We conclude by outlining open challenges—constructing capability graphs from messy real-world interactions, maintaining them as models evolve, and preventing overfitting to the graph itself—and argue that the transition from tokens and parameters to tasks and capabilities will be central to the next phase of LLM research.