Synthetic Conversation Dataset Using Large Language Models

Dhivya Nagasubramanian

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Most existing open-source datasets are designed around single-turn interactions, such as question-answering tasks, or are based on monotonic speech datasets taken from audiobooks, where a single speaker talks throughout. These datasets do not fully capture the dynamic nature of real-world conversations, which involve multiple speakers, shifting tones, and diverse dialects. Unlike ImageNet, which has played a key role in advancing image recognition, the speech AI research community currently lacks a comprehensive, diverse, multilingual dataset for conversational speech. To fill this gap, we introduce the Multi-Lingual Dialogue Dataset (MLDD), consisting of 200,000 multi-turn dialogue samples. The topic for the conversation generation is derived from the New York Times annotated corpus, and we enhance it by making the dataset multilingual using large language model (LLM) capabilities.Additionally, the emotion of the conversation is set through LLM prompting, and the pitch and talking speed of the dialogues are set through text-to-speech models to mimic real-world conversations. The Multi-Lingual Dialogue dataset (MLDD) is generated by prompting the LLM with article titles and summaries from the New York Times and providing emotional tone as input to produce engaging multi-turn conversations. To demonstrate the utility and complexity of the MLDD, we evaluate it using audio-augmented large language models. Our results show the practical applications of this dataset for more interactive and nuanced dialogue.

Version published to 10.20944/preprints202510.2025.v1
Oct 27, 2025

Conversations From Make-Believe: An Attentive Encoder–Decoder Chatbot Trained on Scripted Dialogue

This article has 1 author:
1. Sourabh Subhash Rajput
This article has no evaluationsLatest version Jan 29, 2026
Addressing Challenges in Multimodal Large Language Model Development

This article has 4 authors:
1. Feidlimid Shyama
2. Lucas Pereira
3. João Souza
4. Ana Costa
This article has no evaluationsLatest version Dec 22, 2025
Sentiment Analysis of Naturalistic Speech Using Open-Weight Large Language Models

This article has 5 authors:
1. Jeffrey M. Girard
2. Daiil Jun
3. Desmond Ong
4. Einat Liebenthal
5. Justin T. Baker
This article has no evaluationsLatest version Dec 23, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Conversations From Make-Believe: An Attentive Encoder–Decoder Chatbot Trained on Scripted Dialogue

Addressing Challenges in Multimodal Large Language Model Development

Sentiment Analysis of Naturalistic Speech Using Open-Weight Large Language Models