Leveraging Pāṇinian Grammar and Neural Models for Morphologically Rich Sanskrit NLP

Yashawant Pathak
Jagdish Makhijani

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Sanskrit’s rule-based grammatical precision and morphological richness make it a compelling foundation for developing linguistically informed Natural Language Processing (NLP) systems. Rooted in Pāṇini’s Aṣṭādhyāyī, the language encodes syntactic and semantic relations directly within its word forms, offering structural advantages over modern, statistically modeled languages. This study introduces a hybrid framework that integrates symbolic Pāṇinian grammar with neural architectures to enhance preprocessing and downstream language understanding. Specifically, sandhi splitting (euphonic decomposition) is re-engineered as an alternative to conventional stopword removal, preserving semantic integrity while improving feature granularity. The framework combines rule-based segmentation (sanskrit_parser) with data-driven sequence models (CharSS and ByT5), evaluated on the SandhiKosh and Digital Corpus of Sanskrit datasets for tasks such as word segmentation, sentiment classification, and syntactic parsing. Experimental results demonstrate that sandhi-aware preprocessing yields up to 8 percent higher F1 scores compared with conventional pipelines, confirming the synergistic potential of grammatical formalism and deep learning. Beyond advancing Sanskrit NLP, the proposed approach contributes a transferable methodology for processing other morphologically rich, low-resource languages, bridging ancient linguistic theory with modern computational intelligence.

Version published to 10.21203/rs.3.rs-7947104/v1 on Research Square
Oct 30, 2025

Morphological-Core Tokenization: A Novel Approach to Preserve Semantic Integrity in Large Language Models

This article has 1 author:
1. Hemanth Manchabale Papachappa
This article has no evaluationsLatest version Oct 23, 2025
AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP

This article has 1 author:
1. Badal Nyalang
This article has no evaluationsLatest version Nov 18, 2025
Kren-M: Meghalaya's First Foundational AI Model for the Khasi Language

This article has 1 author:
1. Badal Nyalang
This article has no evaluationsLatest version Nov 19, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Morphological-Core Tokenization: A Novel Approach to Preserve Semantic Integrity in Large Language Models

AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP

Kren-M: Meghalaya's First Foundational AI Model for the Khasi Language