Domain-Agnostic Translation of Natural Language Text to Cypher Query Language for GraphRAG
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
GraphRAG is a retrieval-augmented generation (RAG) framework that leverages knowledge graphs. Among the various knowledge retrieval techniques used with GraphRAG, subgraph retrieval using Cypher queries is employed by SubGraph Retrieval Augmented Generation (SG-RAG). However, SG-RAG relies on manually crafted Cypher templates, which limits its practicality and scalability in real-world applications. To address this limitation, we propose a domain-agnostic Text-to-Cypher (Text2Cypher) translation model as a flexible subgraph retrieval mechanism for SG-RAG and other GraphRAG-based methods. Due to the absence of large-scale, multi-domain Text2Cypher datasets, we generate a synthetic multi-domain Text2Cypher dataset and fine-tune a large language model (LLM) on this data. Furthermore, we introduce a GPT-based evaluation metric that does not require access to a populated graph database. We evaluate the fine-tuned model on both the generated dataset and the MetaQA benchmark. Experimental results demonstrate that our model significantly outperforms open-source generative LLMs across multiple few-shot settings, as well as the Text2Cypher model proposed by Neo4j. Finally, we analyze the correlation between the proposed GPT-based evaluation metric and execution-based F1 scores on MetaQA using the Pearson correlation coefficient, revealing a strong positive correlation.