Leveraging Pāṇinian Grammar and Neural Models for Morphologically Rich Sanskrit NLP

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Sanskrit’s rule-based grammatical precision and morphological richness make it a compelling foundation for developing linguistically informed Natural Language Processing (NLP) systems. Rooted in Pāṇini’s Aṣṭādhyāyī, the language encodes syntactic and semantic relations directly within its word forms, offering structural advantages over modern, statistically modeled languages. This study introduces a hybrid framework that integrates symbolic Pāṇinian grammar with neural architectures to enhance preprocessing and downstream language understanding. Specifically, sandhi splitting (euphonic decomposition) is re-engineered as an alternative to conventional stopword removal, preserving semantic integrity while improving feature granularity. The framework combines rule-based segmentation (sanskrit_parser) with data-driven sequence models (CharSS and ByT5), evaluated on the SandhiKosh and Digital Corpus of Sanskrit datasets for tasks such as word segmentation, sentiment classification, and syntactic parsing. Experimental results demonstrate that sandhi-aware preprocessing yields up to 8 percent higher F1 scores compared with conventional pipelines, confirming the synergistic potential of grammatical formalism and deep learning. Beyond advancing Sanskrit NLP, the proposed approach contributes a transferable methodology for processing other morphologically rich, low-resource languages, bridging ancient linguistic theory with modern computational intelligence.

Article activity feed