Knowledge Grounded Conversational Access to Heterogeneous Institutional Documents via OCR Enabled Hybrid RAG
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Academic institutions generate large volumes of information such as academic regulations, examination schedules, policies, circulars, and administrative notices. Conversational systems are increasingly used to help students and staff access this information quickly through natural language queries. However, many existing conversational systems rely mainly on large language models that generate responses based on internal model knowledge. Because these systems are not fully grounded in institutional documents, they may produce inaccurate or hallucinated responses and often fail to effectively use heterogeneous document sources such as scanned notices, PDFs, and legacy records. To address these limitations, this paper proposes a knowledge centric conversational framework for reliable academic and campus guidance. The proposed approach integrates optical character recognition (OCR), hybrid knowledge retrieval, and retrieval augmented generation (RAG) to ensure that responses are grounded in authoritative institutional knowledge. The framework first converts scanned and image based documents into machine readable text using OCR and preprocessing techniques. The extracted text is segmented into semantic knowledge fragments and represented using vector embeddings. A hybrid retrieval mechanism combining semantic similarity search and keyword matching retrieves relevant knowledge fragments, which are then used by a retrieval augmented language model. A rule based and reasoning module further validates the generated responses. The system is evaluated on a heterogeneous institutional document dataset containing structured, unstructured, and scanned documents. Experimental results show that the proposed framework achieves 89.6% response accuracy, 0.86 precision, 0.84 recall, and an F1 score of 0.85, while reducing the hallucination rate to 4.1%, demonstrating improved reliability and contextual accuracy compared with baseline conversational systems. The implementation in this study are publicly available at DOI: https://doi.org/10.5281/zenodo.19230595.