Synthetic Conversation Dataset Using Large Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Most existing open-source datasets are designed around single-turn interactions, such as question-answering tasks, or are based on monotonic speech datasets taken from audiobooks, where a single speaker talks throughout. These datasets do not fully capture the dynamic nature of real-world conversations, which involve multiple speakers, shifting tones, and diverse dialects. Unlike ImageNet, which has played a key role in advancing image recognition, the speech AI research community currently lacks a comprehensive, diverse, multilingual dataset for conversational speech. To fill this gap, we introduce the Multi-Lingual Dialogue Dataset (MLDD), consisting of 200,000 multi-turn dialogue samples. The topic for the conversation generation is derived from the New York Times annotated corpus, and we enhance it by making the dataset multilingual using large language model (LLM) capabilities.Additionally, the emotion of the conversation is set through LLM prompting, and the pitch and talking speed of the dialogues are set through text-to-speech models to mimic real-world conversations. The Multi-Lingual Dialogue dataset (MLDD) is generated by prompting the LLM with article titles and summaries from the New York Times and providing emotional tone as input to produce engaging multi-turn conversations. To demonstrate the utility and complexity of the MLDD, we evaluate it using audio-augmented large language models. Our results show the practical applications of this dataset for more interactive and nuanced dialogue.