Constructing the Corpus of Children’s Video Media (CCVM): Decisions that Facilitate Corpus Comparisons and Reuse

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

A growing number of psycholinguistic studies use methods from corpus linguistics to examine the language that children encounter in their environment to understand how they might acquire different aspects of linguistic knowledge. Many of these studies focus on child-directed speech or children’s literature, while there is a paucity of work focusing on children’s television and video media. We describe the creation and contents of the Corpus of Children’s Video Media (CCVM), a specialised corpus designed to represent the spoken language in television and online videos popular among 3-5-year-old children in the UK. The CCVM was designed to be comparable to an existing corpus of child-directed speech (CDS). We used a dual sampling approach: inclusion decisions were guided by (a) a survey of parents with children in our target age group, and (b) a survey of programmes available on popular streaming platforms. The resulting corpus consists of 233,471 tokens across 161 transcripts (43.12 hours of video) and is available on the Open Science Framework (OSF) as a database of tokens (including gloss, stem and lemma forms, and part-of-speech tags), scrambled within transcripts, together with relevant metadata for each transcript. Throughout the paper we discuss the challenges of creating a corpus that is comparable to existing datasets and highlight the importance of transparency in this process. We take an open science approach, sharing a detailed data collection and processing protocol, code, and data so that the corpus can be evaluated, extended, and used appropriately by other research teams.

Article activity feed