A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA across English and Italian
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study presents a comprehensive evaluation of state-of-the-art embedding techniques and large language models (LLMs) for enhancing Information Retrieval (IR) and Question Answering (QA) tasks across multiple languages, with a focus on English and Italian. Our work addresses a critical gap in the current literature by providing empirical evidence of model performance across linguistic boundaries. For IR tasks, we evaluate 12 embedding models across diverse datasets including SQuAD, DICE, SciFact, ArguAna, and NFCorpus. For QA tasks, we employ 4 LLMs (GPT4o, LLama-3.1 8B, Mistral-Nemo, and Gemma-2b) in a retrieval-augmented generation (RAG) pipeline, evaluating on SQuAD, CovidQA, and NarrativeQA datasets, including cross-lingual scenarios. Results demonstrate that multilingual models achieve competitive performance compared to language-specific ones, with embed-multilingual-v3.0 attaining top nDCG@10 scores of 0.90 and 0.86 for English and Italian respectively. In QA tasks, Mistral-Nemo excels in answer relevance (0.91-1.0) while maintaining strong groundedness (0.64-0.78). Our findings reveal that: (1) multilingual embedding models effectively bridge cross-lingual performance gaps, (2) model size does not consistently correlate with performance, and (3) QA systems exhibit a critical trade-off between answer relevance and factual groundedness. Our evaluation framework combines traditional metrics with novel LLM-based assessments, establishing new benchmarks for multilingual language technologies and providing actionable insights and practical guidelines for deploying IR and QA systems in real-world applications.