Re-Thinking Training Data: From Tokens and Parameters to Tasks and Capabilities

Feng Chen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Current large language model (LLM) practice is organized around two primary axes: the number of tokens consumed during training and the number of parameters used to fit them. These proxies have served as effective scaling knobs, but they are only loosely coupled to what actually matters for scientific and real-world deployment: which capabilities the model acquires, how robustly it can exercise them, and how gracefully it fails. Training on undifferentiated text streams with aggregate loss functions treats “all tokens as equal,” while downstream users care about highly structured task families such as multi-step reasoning, tool use, code synthesis, experimental planning, or negotiation. In this Perspective, we argue for a shift from token-centric to capability-centric training. We sketch a framework in which tasks are represented as nodes and edges in a capability graph, capturing pre-requisites, compositional structure, and transfer relations. This representation enables new training regimes that explicitly target underdeveloped regions of capability space through curriculum design, data selection, and synthetic data generation. It also suggests system-level architectures where different agents specialize on subgraphs, coordinated by a higher-level planner. We further discuss how capability-centric thinking reshapes evaluation, red-teaming, and safety: instead of aggregate benchmarks, we can reason about coverage, brittleness, and failure modes at the level of subgraphs and capability clusters. We conclude by outlining open challenges—constructing capability graphs from messy real-world interactions, maintaining them as models evolve, and preventing overfitting to the graph itself—and argue that the transition from tokens and parameters to tasks and capabilities will be central to the next phase of LLM research.

Version published to 10.20944/preprints202512.0499.v1
Dec 4, 2025

EPMORE: Explainable Process Mixture-of-Experts

This article has 7 authors:
1. Wei Sheng
2. Chengzhu Xiao
3. Lunhao Ao
4. Junyan Long
5. Ye Yu
6. Yangguang Jia
7. Qihua Zhang
This article has no evaluationsLatest version Feb 5, 2026
Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods

This article has 5 authors:
1. Deepshikha Bhati
2. Fnu Neha
3. Devi Sri Bandaru
4. Matthew Weber
5. Ishan Dilipbhai Gajera
This article has no evaluationsLatest version Jan 15, 2026
Exploring Universal Human Values with Large Language Models: The AWARE-Value Model

This article has 8 authors:
1. Xin Zhang
2. Yuanyi Ren
3. Zheng Guo
4. Lai Wei
5. Tianrui Huo
6. Haoran Ye
7. Yuhang Xie
8. Guojie Song
This article has no evaluationsLatest version Jan 30, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

EPMORE: Explainable Process Mixture-of-Experts

Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods

Exploring Universal Human Values with Large Language Models: The AWARE-Value Model