Grammar-Driven Text Segmentationfor Context Understanding of Myanmar Language

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Text segmentation is a foundational process in NLP, but Myanmar presents a particularly difficult case due to its non-segmented, syllabic writing system and complex morphophonology. While syllable-level tokenization has seen notable progress, word- or meaningful phrase-level segmentation remains underexplored. Existing approaches largely rely on dictionary-based algorithms that ignore grammatical structure and contextual cues. This research proposes a new text segmentation method that integrates linguistic features such as conjunctions, post-particles, and whitespace usage, which carry implicit syntactic and semantic information. The proposed method produces stable and linguistically motivated segmentation across both formal and informal text styles, achieving an intrinsic boundary-level segmentation F1-score of 0.9 and a token-level segmentation F1-score of 0.8 across three different datasets. Extrinsic evaluation using named entity recognition (NER) as a downstream task demonstrates competitive performance across multiple entity types, with F1-scores of approximately 0.7 for DATE, NUM, LOC, PER, and TIME, even when applied without retraining the NER model. These results highlight the effectiveness and practical applicability of grammar-aware segmentation for Myanmar and underscore its potential for downstream NLP tasks in low-resource settings.

Article activity feed