Integrating HPSG (Head-driven Phrase Structure Grammar) with Neural Parsing for Bengali
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In this paper, I aimed to develop a neural parser for Bangla based on simplified Head-driven Phrase Structure Grammar (HPSG) with neural network-based models. The initial stage in natural language processing is to break down the text into separate tokens. When the text corpus is huge, covering all words is inefficient regarding size of vocabulary. The effectiveness of a specific tokenization method varies on various factors, such as size of the dataset, the nature of the task, and the morphological complexity of the dataset. Due to the lack of existing HPSG-compliant treebanks for Bangla, we utilized syntactically annotated resources from existing Bangla corpora and modified them to align with simplified HPSG rule-based restructuring and data permutation. After that we modified a neural parser architecture originally designed for the Penn Treebank, replacing its encoder with multilingual pre-trained models such as XLM-RoBERTa and IndicBERT to better capture the syntactic and lexical entries of Bangla. We conducted experimental evaluations on the modified dataset, and the parser demonstrated promising results in both constituency and dependency parsing tasks. Our extensive experiments showed that the simplified HPSG Neural Parser achieved a new state-of-the-art for constituency parsing when using the same predicted part-of-speech (POS) tags as the self-attentive constituency parser. Additionally, it outperformed previous studies in dependency parsing with a higher Unlabeled Attachment Score (UAS). However, our parser remained lower Labeled Attachment Score (LAS) scores likely due to integrating HPSG with neural approaches for Bangla syntax parsing and underscoring the importance of linguistically informed treebank development in low-resource languages. Lastly, the research findings of this paper suggest that simplified HPSG should be given more attention to linguistic experts when developing treebanks for Bangla Natural Language Processing (BNLP).