Understanding the Impact of Dataset Characteristics on RAG based Multi-hop QA Performance
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) have improved natural language understanding and made QA (question answering) systems more common in everyday tools and platforms. As these systems are used more frequently, their performance becomes more important, and their tolerance for error decreases. One of the biggest problems with LLMs is hallucination; when they cannot infer the answer from the context, they use huge information sources and give answers that are not supported by the text. To alleviate this problem, systems such as Retrieval-Augmented Generation (RAG) have been developed. They combine the power of language models with external information sources. In real-life questions, the answer is often not directly visible and requires reasoning between multiple pieces of information. Multi-hop QA datasets are useful for testing such systems realistically and have a structure that requires inference. However, each dataset has different characteristics that can affect performance and require different architectural requirements. In this study, we test a RAG system on three multi-hop QA datasets: HotpotQA, QASPER, and MultiHopQA. In addition to the overall results, we also conduct in-depth performance analysis by question type, difficulty level, and reasoning complexity to better understand the system behavior. The results show that MultiHopQA achieved the best performance (Cosine: 0.961, BERT F1: 0.979), while QASPER was more difficult (Cosine: 0.257, BERT F1: 0.624), and HotpotQA had moderate results (Cosine: 0.641, BERT F1: 0.754). BERTScore proved more effective than cosine similarity for measuring semantic alignment in all three datasets.