A Comparative Analysis of Tokenization Methods for Sinhala Natural Language Processing

Ransaka Ravihara

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Tokenization is a foundational step in Natural Language Processing (NLP), yet its impact on morphologically rich, low-resource languages like Sinhala is not well understood. This paper presents a systematic evaluation of five tokenization strategies—Byte, Character, Grapheme Cluster, WordPiece, and Word-level—to determine their effect on downstream task performance and computational efficiency. We train and assess Transformer-based models on four datasets: a clean baseline, and three variants synthetically corrupted with minor typos, aggressive typos, and code-mixing to simulate real-world text. Our results reveal a critical trade-off. Word-level tokenization achieves the highest F1-score (0.727) on clean text and is the most computationally efficient, but its performance degrades significantly on noisy text. Conversely, WordPiece demonstrates superior robustness, maintaining high performance across all conditions, making it the most reliable choice for real-world applications, albeit at a higher computational cost. Grapheme Cluster tokenization emerges as a strong, balanced alternative. This study provides crucial empirical evidence to guide the selection of tokenizers for Sinhala NLP, establishing a baseline for performance, robustness, and efficiency.

Version published to 10.20944/preprints202508.0561.v1
Aug 7, 2025

Tokens with Meaning: A Hybrid Tokenization Approach for NLP

This article has 7 authors:
1. M. Ali Bayram
2. Ali Arda Fincan
3. Ahmet Semih Gümüş
4. Sercan Karakaş
5. Banu Diri
6. Savaş Yıldırım
7. Demircan Çelik
This article has no evaluationsLatest version Aug 6, 2025
Fine-Tuning Large Language Models for Kazakh Text Simplification

This article has 3 authors:
1. Alymzhan Toleu
2. Gulmira Tolegen
3. Irina Ualiyeva
This article has no evaluationsLatest version Jul 26, 2025
Bidirectional Transformer-Based Neural Machine Translation for Amharic and Tigrinya: Bridging Morphological Complexity and Data Scarcity

This article has 2 authors:
1. Matiyas Gutema Angecha
2. Martha Yifiru Tachbelie
This article has no evaluationsLatest version Aug 1, 2025

Listed in

Abstract

Article activity feed

Related articles

Tokens with Meaning: A Hybrid Tokenization Approach for NLP

Fine-Tuning Large Language Models for Kazakh Text Simplification

Bidirectional Transformer-Based Neural Machine Translation for Amharic and Tigrinya: Bridging Morphological Complexity and Data Scarcity