Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging

Azam Nouri

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Subword tokenization is a key design choice for modern language models, including large language models (LLMs), with byte/character-level BPE serving as a widely used baseline. Standard BPE selects merges by raw pair frequency, which favors compression but can conflate true adjacency cohesion with pairs that are frequent due to high marginal counts. We introduce Significance-Gain BPE, a drop-in alternative merge criterion that measures cohesion via a z-statistic under an independence null model and combines it with an explicit compression-aware gain term. We evaluate Significance-Gain BPE on WikiText-103 (raw) character slices using a small causal Transformer language model and report both token-dependent perplexity and the tokenizer-invariant metric bits-per-character (BPC). At a representative operating point, Significance-Gain BPE reduces validation and test perplexity by 13% and 12%, respectively, and improves validation and test BPC by ∼0.9–1.0%. A vocabulary-size sweep further shows lower BPC in most closest-compression comparisons, suggesting that statistically grounded merge selection can improve predictive efficiency per unit of raw text across a range of compression regimes.

Version published to 10.21203/rs.3.rs-9077250/v1 on Research Square
Mar 11, 2026

Fine-grained Debiasing for Large Language Modelsvia Bias Intensity and Probability Decoupling

This article has 4 authors:
1. Zhuge Yan
2. Xiaolong Gong
3. Wangchao Wu
4. Zhike Han
This article has no evaluationsLatest version Apr 6, 2026
reEtym: A Natively Feature-Disentangled Transformer for Interpretability

This article has 1 author:
1. Hongyu Shi
This article has no evaluationsLatest version Apr 15, 2026
A diagnostic and evaluative analysis of PARSEME corpora complexity

This article has 3 authors:
1. Santiago Fernández Lanza
2. Víctor Manuel Darriba Bilbao
3. Daniel Fernández-González
This article has no evaluationsLatest version Mar 30, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Fine-grained Debiasing for Large Language Modelsvia Bias Intensity and Probability Decoupling

reEtym: A Natively Feature-Disentangled Transformer for Interpretability

A diagnostic and evaluative analysis of PARSEME corpora complexity