Grammar-Driven Text Segmentationfor Context Understanding of Myanmar Language

myo thida
Nu Wei Thet
Thein Kyaw LWIN

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Text segmentation is a foundational process in NLP, but Myanmar presents a particularly difficult case due to its non-segmented, syllabic writing system and complex morphophonology. While syllable-level tokenization has seen notable progress, word- or meaningful phrase-level segmentation remains underexplored. Existing approaches largely rely on dictionary-based algorithms that ignore grammatical structure and contextual cues. This research proposes a new text segmentation method that integrates linguistic features such as conjunctions, post-particles, and whitespace usage, which carry implicit syntactic and semantic information. The proposed method produces stable and linguistically motivated segmentation across both formal and informal text styles, achieving an intrinsic boundary-level segmentation F1-score of 0.9 and a token-level segmentation F1-score of 0.8 across three different datasets. Extrinsic evaluation using named entity recognition (NER) as a downstream task demonstrates competitive performance across multiple entity types, with F1-scores of approximately 0.7 for DATE, NUM, LOC, PER, and TIME, even when applied without retraining the NER model. These results highlight the effectiveness and practical applicability of grammar-aware segmentation for Myanmar and underscore its potential for downstream NLP tasks in low-resource settings.

Version published to 10.21203/rs.3.rs-8408602/v1 on Research Square
Jan 23, 2026

Derivational Morphology and Word Formation: Functional Directions in Contemporary English and Azerbaijani

This article has 3 authors:
1. Gulara Guliyeva
2. Narmina Aliyeva
3. Lea Oksanen
This article has no evaluationsLatest version Feb 24, 2026
AMPS-JuST: Dataset of Annotated Judgements from the Small Claims Tribunal

This article has 6 authors:
1. Charlie Abela
2. Ivan Mifsud
3. Joel Azzopardi
4. Kurt Xerri
5. James Farrugia
6. Ayrton Azzopardi
This article has no evaluationsLatest version Feb 11, 2026
Character Semantic-Phonetic Structure Enhance Language Models in Classical Chinese

This article has 4 authors:
1. Bolin Chang
2. Bin Li
3. Zhixing Xu
4. Shiyan Ou
This article has no evaluationsLatest version Mar 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Derivational Morphology and Word Formation: Functional Directions in Contemporary English and Azerbaijani

AMPS-JuST: Dataset of Annotated Judgements from the Small Claims Tribunal

Character Semantic-Phonetic Structure Enhance Language Models in Classical Chinese