IIT Delhi Dialogue Corpus: A Quantitative Analysis of a Spoken Corpus of Hindi

Benu Pareek
Mudafia Zafar
Meghna Hooda
Karan Yadav
Ashwini Vaidya
Samar Husain

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We present our effort to create a dialogue corpus for Hindi with the aim of under-standing (a) the nature of linguistic utterances during naturalistic dialogue, (b)what these linguistic patterns tell us about the cognitive processes/constraints that affect production and comprehension during dialogue, and (c) how do such processes/constraints differ from written text. We discuss the procedure and pipeline employed to create two sets of spoken data -- telephonic conversation data, and face-to-face (task-oriented) conversation data. At the lexical level, the data has been annotated for information such as disfluencies, code-switching, etc., and at the syntactic level for part-of-speech tags and dependency relations.We present a preliminary analysis of the created dialogue data and compare it with a written text to discuss the usefulness and implications of this resource for psycholinguistic research.

Version published to 10.31219/osf.io/tc7f5_v1 on OSF Preprints
Feb 24, 2025

Conversations From Make-Believe: An Attentive Encoder–Decoder Chatbot Trained on Scripted Dialogue

This article has 1 author:
1. Sourabh Subhash Rajput
This article has no evaluationsLatest version Jan 29, 2026
Down-sampling strategies in corpus phonology

This article has 1 author:
1. Lukas Sönning
This article has no evaluationsLatest version Dec 12, 2025
Tense–Aspect Variation Across Oral Narrative Types: A Corpus-Based Comparative Study of Russian and Polish

This article has 1 author:
1. Katrin Bente Karl
This article has no evaluationsLatest version Jan 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Conversations From Make-Believe: An Attentive Encoder–Decoder Chatbot Trained on Scripted Dialogue

Down-sampling strategies in corpus phonology

Tense–Aspect Variation Across Oral Narrative Types: A Corpus-Based Comparative Study of Russian and Polish