Integrating HPSG (Head-driven Phrase Structure Grammar) with Neural Parsing for Bengali

Maneesha Rani Biswas

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In this paper, I aimed to develop a neural parser for Bangla based on simplified Head-driven Phrase Structure Grammar (HPSG) with neural network-based models. The initial stage in natural language processing is to break down the text into separate tokens. When the text corpus is huge, covering all words is inefficient regarding size of vocabulary. The effectiveness of a specific tokenization method varies on various factors, such as size of the dataset, the nature of the task, and the morphological complexity of the dataset. Due to the lack of existing HPSG-compliant treebanks for Bangla, we utilized syntactically annotated resources from existing Bangla corpora and modified them to align with simplified HPSG rule-based restructuring and data permutation. After that we modified a neural parser architecture originally designed for the Penn Treebank, replacing its encoder with multilingual pre-trained models such as XLM-RoBERTa and IndicBERT to better capture the syntactic and lexical entries of Bangla. We conducted experimental evaluations on the modified dataset, and the parser demonstrated promising results in both constituency and dependency parsing tasks. Our extensive experiments showed that the simplified HPSG Neural Parser achieved a new state-of-the-art for constituency parsing when using the same predicted part-of-speech (POS) tags as the self-attentive constituency parser. Additionally, it outperformed previous studies in dependency parsing with a higher Unlabeled Attachment Score (UAS). However, our parser remained lower Labeled Attachment Score (LAS) scores likely due to integrating HPSG with neural approaches for Bangla syntax parsing and underscoring the importance of linguistically informed treebank development in low-resource languages. Lastly, the research findings of this paper suggest that simplified HPSG should be given more attention to linguistic experts when developing treebanks for Bangla Natural Language Processing (BNLP).

Version published to 10.21203/rs.3.rs-8056398/v1 on Research Square
Jan 16, 2026

Part-of-Speech Tagging for the Kangri Language Using CRF and BiLSTM Models: A Comprehensive Comparative Study

This article has 1 author:
1. Prateek Kaushal
This article has no evaluationsLatest version Jan 6, 2026
Grammar-Driven Text Segmentationfor Context Understanding of Myanmar Language

This article has 3 authors:
1. myo thida
2. Nu Wei Thet
3. Thein Kyaw LWIN
This article has no evaluationsLatest version Jan 23, 2026
Coreference Resolution for Amharic Text using Bidirectional Encoder Representation from Transformer (BERT)

This article has 2 authors:
1. Lingerew Bantie
2. Yaregal Assabie
This article has no evaluationsLatest version Jan 12, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Part-of-Speech Tagging for the Kangri Language Using CRF and BiLSTM Models: A Comprehensive Comparative Study

Grammar-Driven Text Segmentationfor Context Understanding of Myanmar Language

Coreference Resolution for Amharic Text using Bidirectional Encoder Representation from Transformer (BERT)