Tokens with Meaning: A Hybrid Tokenization Approach for NLP

M. Ali Bayram
Ali Arda Fincan
Ahmet Semih Gümüş
Sercan Karakaş
Banu Diri
Savaş Yıldırım
Demircan Çelik

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Tokenization plays a pivotal role in natural language processing (NLP), shaping how textual data is segmented, interpreted, and processed by language models. Despite the success of subword-based tokenization techniques such as Byte Pair Encoding (BPE) and WordPiece, these methods often fall short in morphologically rich and agglutinative languages due to their reliance on statistical frequency rather than linguistic structure. This paper introduces a linguistically informed hybrid tokenization framework that integrates rule-based morphological analysis with statistical subword segmentation to address these limitations. The proposed approach leverages phonological normalization, root-affix dictionaries, and a novel tokenization algorithm that balances morpheme preservation with vocabulary efficiency. It assigns shared identifiers to phonologically variant affixes (e.g., \textit{-ler} and \textit{-lar}) and phonologically altered root forms (e.g., \textit{kitap} vs.\\textit{kitabı}), significantly reducing redundancy while maintaining semantic integrity. The framework also incorporates special tokens for whitespace and orthographic case, including an \texttt{} token to prevent vocabulary inflation from capitalization. Byte Pair Encoding is integrated to support out-of-vocabulary coverage without compromising morphological coherence. Evaluation on the TR-MMLU benchmark—a large-scale, Turkish-specific NLP benchmark—demonstrates that the proposed tokenizer achieves the highest Turkish Token Percentage (90.29\%) and Pure Token Percentage (85.8\%) among all tested models. Comparative analysis against widely used tokenizers from models such as LLaMA, Gemma, and OpenAI's GPT reveals that the proposed method yields more linguistically meaningful and semantically coherent tokens. A qualitative case study further illustrates improved morpheme segmentation and interpretability in complex Turkish sentences. Although the implementation focuses on Turkish, the underlying methodology is language-independent and adaptable to other languages. This work contributes to ongoing efforts to improve tokenizer design through linguistic alignment, offering a practical and extensible solution for enhancing both interpretability and performance in multilingual NLP systems.

Version published to 10.21203/rs.3.rs-6513777/v1 on Research Square
Aug 6, 2025

A Comparative Analysis of Tokenization Methods for Sinhala Natural Language Processing

This article has 1 author:
1. Ransaka Ravihara
This article has no evaluationsLatest version Aug 7, 2025
Word Sense Disambiguation (WSD) in Indonesian Sentences Using Simplified Lesk Algorithm

This article has 3 authors:
1. Nurul Akhni
2. Abdiansah
3. Danny Matthew Saputra
This article has no evaluationsLatest version Aug 28, 2025
Evaluating an LLM’s Performance in Annotating Discourse Strategies

This article has 2 authors:
1. Taylor Meizlish
2. Chris Ziffo
This article has no evaluationsLatest version Sep 2, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Comparative Analysis of Tokenization Methods for Sinhala Natural Language Processing

Word Sense Disambiguation (WSD) in Indonesian Sentences Using Simplified Lesk Algorithm

Evaluating an LLM’s Performance in Annotating Discourse Strategies