Constructing the Corpus of Children’s Video Media (CCVM): A New Resource and Guidelines for Constructing Comparable and Reusable Corpora
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
A growing number of psycholinguistic studies use methods from corpus linguistics to examine the language that children encounter in their environment to understand how they might acquire different aspects of linguistic knowledge. Many of these studies focus on child-directed speech or children’s literature, while there is a paucity of work focusing on children’s television and video media. We describe the creation and contents of the Corpus of Children’s Video Media (CCVM), a specialised corpus designed to represent the spoken language in television and online videos popular among 3-5-year-old children in the UK (available as a scrambled database of tokens). The CCVM was designed to be comparable to an existing corpus of child-directed speech (CDS). We used a dual sampling approach: inclusion decisions were guided by (a) a survey of parents with children in our target age group, and (b) a survey of programmes available on popular streaming platforms. The corpus consists of 233,471 tokens across 161 transcripts (43.12 hours of video) and is available on the Open Science Framework (OSF) as a scrambled database of tokens (including gloss, stem and lemma forms, and part-of-speech tags), organised within transcripts, together with relevant metadata for each transcript. We discuss the challenges of creating a corpus that is comparable to existing datasets and highlight the importance of transparency in this process. We take an open science approach, sharing a detailed data collection and processing protocol, code, and data so that the corpus can be evaluated, extended, and used appropriately by other research teams.