KM-Chat: A Large-Scale Synthetic Question-Answer Dataset for Open-Domain Conversational AI

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Recent advancements in large language models (LLMs) have significantly transformed natural language processing, particularly in the development of conversational agents. Despite these advancements, the creation of robust dialogue systems remains constrained by the limited availability of large-scale, high-quality conversational datasets. To address this gap, this study introduces KM-Chat, a comprehensive synthetic question–answer dataset specifically designed for open-domain conversational AI research. The dataset consists of 250,003 Q&A pairs, systematically generated using state-of-the-art LLMs through a multi-stage pipeline incorporating controlled sampling techniques, iterative batch generation, and rigorous post-processing. KM-Chat covers a wide range of conversational contexts, including both general-purpose and technical domains, thereby enhancing contextual diversity and adaptability. By ensuring scalability, linguistic variety, and structural consistency, KM-Chat provides an essential resource for training and evaluating dialogue systems, fostering advancements in next-generation human-like conversational models.

Article activity feed