An Empirical Study of Compute-Efficient Token–Parameter Scaling in Small Language Models (1M–20M)

Balaji

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent scaling-law studies have shown that language model performance depends critically on how training compute is allocated between model capacity and data scale. While prior work has extensively explored these relationships at large parameter scales, the behavior of scaling laws in the small language model (SLM) regime remains comparatively underexplored. In this work, we conduct a systematic empirical study of compute allocation in Transformerbased language models Vaswani et al. [2017] ranging from 1.6M to 18.4M parameters, trained under fixed token budgets on two datasets with contrasting characteristics: arXiv scientific text and the synthetic Tinystories corpus. By holding architectural depth and optimization settings constant and varying the embedding dimension, we systematically examine token–parameter scaling ratios from 5× to 40×. Due to compute constraints, each configuration was trained once; results should be interpreted as indicative rather than statistically definitive. Across both datasets, we observe that a 10:1 token-to-parameter ratio achieves the lowest validation loss per unit compute within our evaluation grid, consuming approximately 5.56 PFLOPs. Models trained at extreme ratios exhibit two distinct failure modes: capacity bottlenecks at 40× (1.6M parameters) and undertraining at 5× (18.4M parameters). On the arXiv dataset the 10× model outperforms 5× configuration by 0.0280 absolute validation loss, whereas on Tinystories dataset it is only 0.0027.

Version published to 10.21203/rs.3.rs-8948933/v1 on Research Square
Feb 25, 2026

Do Transformers Always Win? An Empirical Study of Semantic Embeddings for Short-Text E-commerce Reviews

This article has 4 authors:
1. Longying Lai
2. Zhiyuan Cheng
3. Kai Cheng
4. Xiaoxi Qi
This article has no evaluationsLatest version Mar 20, 2026
Attention Amplification in Multilingual LLMs: Why Script Representation Matters

This article has 3 authors:
1. Yash Mishra
2. Suyash Mishra
3. Kedarnath senapati
This article has no evaluationsLatest version Feb 25, 2026
Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging

This article has 1 author:
1. Azam Nouri
This article has no evaluationsLatest version Mar 11, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Do Transformers Always Win? An Empirical Study of Semantic Embeddings for Short-Text E-commerce Reviews

Attention Amplification in Multilingual LLMs: Why Script Representation Matters

Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging