An Empirical Study of Compute-Efficient Token–Parameter Scaling in Small Language Models (1M–20M)

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Recent scaling-law studies have shown that language model performance depends critically on how training compute is allocated between model capacity and data scale. While prior work has extensively explored these relationships at large parameter scales, the behavior of scaling laws in the small language model (SLM) regime remains comparatively underexplored. In this work, we conduct a systematic empirical study of compute allocation in Transformerbased language models Vaswani et al. [2017] ranging from 1.6M to 18.4M parameters, trained under fixed token budgets on two datasets with contrasting characteristics: arXiv scientific text and the synthetic Tinystories corpus. By holding architectural depth and optimization settings constant and varying the embedding dimension, we systematically examine token–parameter scaling ratios from 5× to 40×. Due to compute constraints, each configuration was trained once; results should be interpreted as indicative rather than statistically definitive. Across both datasets, we observe that a 10:1 token-to-parameter ratio achieves the lowest validation loss per unit compute within our evaluation grid, consuming approximately 5.56 PFLOPs. Models trained at extreme ratios exhibit two distinct failure modes: capacity bottlenecks at 40× (1.6M parameters) and undertraining at 5× (18.4M parameters). On the arXiv dataset the 10× model outperforms 5× configuration by 0.0280 absolute validation loss, whereas on Tinystories dataset it is only 0.0027.

Article activity feed