ContrastGen: A Multi-Agent Contrastive Framework for Hard Retrieval Data Generation and Mining

Tianci Huang
Haozhao Wang
Junpeng Zhao
Gaosheng Wu
Wenchao Xu
Ruixuan Li

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The embedding model vectorizes queries and passages separately and uses the distance between the two resulting vectors as the basis for retrieval matching. It serves as a core component in retrieval tasks. However, since training datasets often consist predominantly of simple queries, the embedding model is usually unable to develop the capability to handle complex, hard queries. This leads to a serious performance bottleneck and an upper limit on its effectiveness. To address the challenge of handling hard queries, existing methods propose new training strategies tailored for embedding models or simplification mechanisms during the query inference phase. In contrast and orthogonal to these approaches, this paper focuses on tackling the problem from the data level, aiming to improve the performance of the embedding model by generating high-quality hard query training data. More specifically, inspired by the ability of agents to closely simulate human behavior, and with the goal of generating queries that retain semantics and logical knowledge similar to those of human-generated queries, this paper proposes a multi-agent framework to generate hard queries, thereby enhancing the training performance of the embedding model. The core idea involves first using a generation agent to create new queries, followed by specialized agents—such as those focused on logical reasoning and semantic understanding—to filter and identify truly hard queries. Experimental results on different embedding models and datasets demonstrate that our method outperforms existing approaches.

Version published to 10.20944/preprints202505.1358.v2
May 21, 2025

MD2PR: A Multi-level Distillation based Dense Passage Retrieval Model

This article has 3 authors:
1. Haifeng Li
2. Mo Hai
3. Dong Tang
This article has no evaluationsLatest version Apr 17, 2025
Multimodal and Distributed LLMs: Bridging Scalability and Cross-Modal Reasoning

This article has 4 authors:
1. Rajesh Kumar
2. Isabelle Laurent
3. David Müller
4. Klaus Elli
This article has no evaluationsLatest version May 15, 2025
A Graph-Retrieval-Augmented Generation Framework Enhances Decision-Making in the Circular Economy

This article has 10 authors:
1. Yang Zhao
2. Chengxiao Dai
3. Dusit Niyato
4. Chuan Fu Tan
5. Keyi Xiang
6. Yueyang Wang
7. Zhiquan YEO
8. Daren TAN
9. Jonathan Low Zhaozhi
10. Eugene HO
This article has no evaluationsLatest version Jun 3, 2025

Listed in

Abstract

Article activity feed

Related articles

MD2PR: A Multi-level Distillation based Dense Passage Retrieval Model

Multimodal and Distributed LLMs: Bridging Scalability and Cross-Modal Reasoning

A Graph-Retrieval-Augmented Generation Framework Enhances Decision-Making in the Circular Economy