Graph-Based RAG for Manuscript Collections: A LangGraph Approach

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This paper introduces a conversational agent designed for querying digitized historical manuscript collections, developed as part of the MAGIC project. The system incorporates hybrid sparse and dense retrieval, a Neo4j knowledge graph, ALTO XML-based visual grounding, and a multi-step LangGraph ReAct agent powered by a Llama-3.3-70B backend. To identify the most effective retrieval strategy, a benchmark of 100 hand-annotated queries was constructed, covering six query types across two 15th-century incunabula. Multiple retrieval methods were evaluated, including BM25, dense retrieval, hybrid Reciprocal Rank Fusion (RRF), cross-encoder reranking, graph-augmented retrieval, and Hypothetical Document Embeddings (HyDE), using standard information retrieval metrics and latency measurements. The results indicate that Hybrid RRF achieves the most favorable precision–latency trade-off and is positioned on the Pareto frontier for interactive applications. However, no single method demonstrates optimal performance across all query types. Graph-based expansion substantially enhances catalog and complex queries that require relational reasoning, but reduces effectiveness for semantic and entity queries. This outcome supports the adoption of a query-adaptive retrieval strategy within the agent. Furthermore, HyDE consistently underperforms on historical text due to temporal distribution mismatch, resulting in increased latency without corresponding improvements in retrieval quality. A Retrieval-Augmented Generation (RAG) evaluation across three generation backends demonstrates near-perfect faithfulness, indicating reliable grounding. However, answer relevance remains constrained by retrieval precision. These findings identify retrieval, rather than generation, as the primary bottleneck in historical manuscript question answering. All benchmark data, annotations, and system components are made available to support reproducibility and future research.

Article activity feed