Morphological-Core Tokenization: A Novel Approach to Preserve Semantic Integrity in Large Language Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Subword tokenization techniques like Byte-Pair Encoding (BPE) are foundational to modern Large Language Models (LLMs), yet they frequently fragment words into non intuitive, semantically hollow units. This unnatural segmenta- tion poses significant linguistic and semantic challenges, forcing models to expend valuable learning capacity on deciphering basic morphology. This paper introduces Morphological-Core Tokenization (MCT), a novel hybrid tokenization algorithm designed to mitigate these issues. MCT operates by first identifying and preserving the morphological core of a word using a linguistically informed database and then applying a constrained subword segmentation to the remaining affixes. To enhance robustness, we also introduce a stochastic dropout mechanism that regularizes the model’s dependency on the morphological analyzer. We perform experiments comparing MCT to BPE, WordPiece, and a token-free baseline (ByT5) on tasks that require deep morphological understanding. Our results demonstrate that MCT yields a more semantically coherent tokenization, leading to notable performance gains. On the German-to-English WMT14 translation task, our MCT-based model achieves a 1.5-point higher BLEU score than BPE and outperforms the ByT5 baseline while being computationally more efficient. Ablation studies show that the method is reasonably robust to imperfections in the morphological analyzer. This work underscores the value of linguistically informed tokenization as a middle ground between purely statistical subwords and computationally intensive token-free models.

Article activity feed