A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This study presents a comprehensive evaluation of embedding techniques and large language models (LLMs) for Information Retrieval (IR) and question answering (QA) across languages, focusing on English and Italian. We address a significant research gap by providing empirical evidence of model performance across linguistic boundaries. We evaluate 12 embedding models on diverse IR datasets, including Italian SQuAD and DICE, English SciFact, ArguAna, and NFCorpus. We assess four LLMs (GPT4o, LLama-3.1 8B, Mistral-Nemo, and Gemma-2b) for QA tasks within a retrieval-augmented generation (RAG) pipeline. We evaluate them on SQuAD, CovidQA, and NarrativeQA datasets, including cross-lingual scenarios. The results show multilingual models perform more competitively than language-specific ones. The embed-multilingual-v3.0 model achieves top nDCG@10 scores of 0.90 for English and 0.86 for Italian. In QA evaluation, Mistral-Nemo demonstrates superior answer relevance (0.91–1.0) while maintaining strong groundedness (0.64–0.78). Our analysis reveals three key findings: (1) multilingual embedding models effectively bridge performance gaps between English and Italian, though performance consistency decreases in specialized domains, (2) model size does not consistently predict performance, and (3) all evaluated QA systems exhibit a critical trade-off between answer relevance and factual groundedness. Our evaluation framework combines traditional metrics with innovative LLM-based assessment techniques. It establishes new benchmarks for multilingual language technologies while providing actionable insights for real-world IR and QA system deployment.

Article activity feed