Grammar-Driven Text Segmentationfor Context Understanding of Myanmar Language
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Text segmentation is a foundational process in NLP, but Myanmar presents a particularly difficult case due to its non-segmented, syllabic writing system and complex morphophonology. While syllable-level tokenization has seen notable progress, word- or meaningful phrase-level segmentation remains underexplored. Existing approaches largely rely on dictionary-based algorithms that ignore grammatical structure and contextual cues. This research proposes a new text segmentation method that integrates linguistic features such as conjunctions, post-particles, and whitespace usage, which carry implicit syntactic and semantic information. The proposed method produces stable and linguistically motivated segmentation across both formal and informal text styles, achieving an intrinsic boundary-level segmentation F1-score of 0.9 and a token-level segmentation F1-score of 0.8 across three different datasets. Extrinsic evaluation using named entity recognition (NER) as a downstream task demonstrates competitive performance across multiple entity types, with F1-scores of approximately 0.7 for DATE, NUM, LOC, PER, and TIME, even when applied without retraining the NER model. These results highlight the effectiveness and practical applicability of grammar-aware segmentation for Myanmar and underscore its potential for downstream NLP tasks in low-resource settings.