Knowledge Grounded Conversational Access to Heterogeneous Institutional Documents via OCR Enabled Hybrid RAG

Veerababu Reddy
Krishnaveni Kaki
Bharath Kesineni
Iswarya Annapureddy
Harini Kanna
Gargesh Chalamala

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Academic institutions generate large volumes of information such as academic regulations, examination schedules, policies, circulars, and administrative notices. Conversational systems are increasingly used to help students and staff access this information quickly through natural language queries. However, many existing conversational systems rely mainly on large language models that generate responses based on internal model knowledge. Because these systems are not fully grounded in institutional documents, they may produce inaccurate or hallucinated responses and often fail to effectively use heterogeneous document sources such as scanned notices, PDFs, and legacy records. To address these limitations, this paper proposes a knowledge centric conversational framework for reliable academic and campus guidance. The proposed approach integrates optical character recognition (OCR), hybrid knowledge retrieval, and retrieval augmented generation (RAG) to ensure that responses are grounded in authoritative institutional knowledge. The framework first converts scanned and image based documents into machine readable text using OCR and preprocessing techniques. The extracted text is segmented into semantic knowledge fragments and represented using vector embeddings. A hybrid retrieval mechanism combining semantic similarity search and keyword matching retrieves relevant knowledge fragments, which are then used by a retrieval augmented language model. A rule based and reasoning module further validates the generated responses. The system is evaluated on a heterogeneous institutional document dataset containing structured, unstructured, and scanned documents. Experimental results show that the proposed framework achieves 89.6% response accuracy, 0.86 precision, 0.84 recall, and an F1 score of 0.85, while reducing the hallucination rate to 4.1%, demonstrating improved reliability and contextual accuracy compared with baseline conversational systems. The implementation in this study are publicly available at DOI: https://doi.org/10.5281/zenodo.19230595.

Version published to 10.21203/rs.3.rs-9287246/v1 on Research Square
Apr 2, 2026

An AI Driven Voice Enabled System for Automated Resume Aware Interview Question Generation and Semantic Response Evaluation Based on Comprehensive Parameters for Large Language Models

This article has 6 authors:
1. Veerababu Reddy
2. Kiran Katta
3. Keerthi Gali
4. Meghana Ravuri
5. Harini Upputuri
6. Poorna Chandra Korukonda
This article has no evaluationsLatest version Apr 7, 2026
Augmenting Large Language Models with External Data Sources: A Systematic Review of Methodologies, Performance Metrics, and Information Fidelity

This article has 4 authors:
1. Soham Mukherjee
2. John Le
3. Chau Nguyen
4. Thai Vu
This article has no evaluationsLatest version Apr 10, 2026
Retrieval-Augmented Large Language Model Agents for Automated Scientific Literature Review Generation

This article has 6 authors:
1. Ruotong Wang
2. Nyutian Long
3. Shunqi Liu
4. Yuxi Wang
5. Zhen Qi
6. Huajun Zhang
This article has no evaluationsLatest version Apr 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

An AI Driven Voice Enabled System for Automated Resume Aware Interview Question Generation and Semantic Response Evaluation Based on Comprehensive Parameters for Large Language Models

Augmenting Large Language Models with External Data Sources: A Systematic Review of Methodologies, Performance Metrics, and Information Fidelity

Retrieval-Augmented Large Language Model Agents for Automated Scientific Literature Review Generation