Understanding the Impact of Dataset Characteristics on RAG based Multi-hop QA Performance

Nimet Aksoy
Zekeriya Anıl Güven
Murat Osman Ünalır

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) have improved natural language understanding and made QA (question answering) systems more common in everyday tools and platforms. As these systems are used more frequently, their performance becomes more important, and their tolerance for error decreases. One of the biggest problems with LLMs is hallucination; when they cannot infer the answer from the context, they use huge information sources and give answers that are not supported by the text. To alleviate this problem, systems such as Retrieval-Augmented Generation (RAG) have been developed. They combine the power of language models with external information sources. In real-life questions, the answer is often not directly visible and requires reasoning between multiple pieces of information. Multi-hop QA datasets are useful for testing such systems realistically and have a structure that requires inference. However, each dataset has different characteristics that can affect performance and require different architectural requirements. In this study, we test a RAG system on three multi-hop QA datasets: HotpotQA, QASPER, and MultiHopQA. In addition to the overall results, we also conduct in-depth performance analysis by question type, difficulty level, and reasoning complexity to better understand the system behavior. The results show that MultiHopQA achieved the best performance (Cosine: 0.961, BERT F1: 0.979), while QASPER was more difficult (Cosine: 0.257, BERT F1: 0.624), and HotpotQA had moderate results (Cosine: 0.641, BERT F1: 0.754). BERTScore proved more effective than cosine similarity for measuring semantic alignment in all three datasets.

Version published to 10.21203/rs.3.rs-6968562/v1 on Research Square
Jul 2, 2025

Analysis: Serving Individuals with Language Impairments using Automatic Speech Recognition Models and Large Language Models: Challenges and Opportunities

This article has 13 authors:
1. Yiyu Shi
2. Ruiyang Qin
3. Haoxinran Yu
4. Lixuan Wei
5. Yuxuan Liu
6. Dancheng Liu
7. Chenhui Xu
8. Jiajie Li
9. Gelei Xu
10. Ahmed Abbasi
11. Jinjun Xiong
12. Xiufan Yu
13. Zhi Zheng
This article has no evaluationsLatest version Jul 24, 2025
AASE: AI-Driven Automated Answer Script Evaluation

This article has 4 authors:
1. Mitra Abhi Sura
2. Maitreyee Rai
3. SONIA KHETARPAUL
4. Saurabh Mishra
This article has no evaluationsLatest version Jul 17, 2025
Building a Question-Answering System to Extract Information From PDF Files Using BERT Transformers

This article has 3 authors:
1. Rutuja Bhujbal
2. Kislay Raj
3. Teerath Kumar
This article has no evaluationsLatest version Jun 25, 2025

Listed in

Abstract

Article activity feed

Related articles

Analysis: Serving Individuals with Language Impairments using Automatic Speech Recognition Models and Large Language Models: Challenges and Opportunities

AASE: AI-Driven Automated Answer Script Evaluation

Building a Question-Answering System to Extract Information From PDF Files Using BERT Transformers