Constructing the Corpus of Children’s Video Media (CCVM): A New Resource and Guidelines for Constructing Comparable and Reusable Corpora

Anna Gowenlock
Jennifer M Rodd
Beth Malory
Courtenay Norbury

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

A growing number of psycholinguistic studies use methods from corpus linguistics to examine the language that children encounter in their environment to understand how they might acquire different aspects of linguistic knowledge. Many of these studies focus on child-directed speech or children’s literature, while there is a paucity of work focusing on children’s television and video media. We describe the creation and contents of the Corpus of Children’s Video Media (CCVM), a specialised corpus designed to represent the spoken language in television and online videos popular among 3-5-year-old children in the UK (available as a scrambled database of tokens). The CCVM was designed to be comparable to an existing corpus of child-directed speech (CDS). We used a dual sampling approach: inclusion decisions were guided by (a) a survey of parents with children in our target age group, and (b) a survey of programmes available on popular streaming platforms. The corpus consists of 233,471 tokens across 161 transcripts (43.12 hours of video) and is available on the Open Science Framework (OSF) as a scrambled database of tokens (including gloss, stem and lemma forms, and part-of-speech tags), organised within transcripts, together with relevant metadata for each transcript. We discuss the challenges of creating a corpus that is comparable to existing datasets and highlight the importance of transparency in this process. We take an open science approach, sharing a detailed data collection and processing protocol, code, and data so that the corpus can be evaluated, extended, and used appropriately by other research teams.

Version published to 10.31219/osf.io/vptxw_v2 on OSF Preprints
Nov 21, 2025
Version published to 10.31219/osf.io/vptxw_v1 on OSF Preprints
Jul 25, 2025

Learning from the input: a corpus-based investigation of Chinese classifiers in children’s books and child-directed speech

This article has 5 authors:
1. Jinyu Shi
2. Yaling Hsiao
3. Yifan Yang
4. Elizabeth Wonnacott
5. Kate Nation
This article has no evaluationsLatest version Feb 4, 2026
Learning from the input: a corpus-based investigation of Chinese classifiers in children’s books and child-directed speech

This article has 5 authors:
1. Jinyu Shi
2. Yaling Hsiao
3. Yifan Yang
4. Elizabeth Wonnacott
5. Kate Nation
This article has no evaluationsLatest version Feb 4, 2026
Learning from the input: a corpus-based investigation of Chinese classifiers in children’s books and child-directed speech

This article has 5 authors:
1. Jinyu Shi
2. Yaling Hsiao
3. Yifan Yang
4. Elizabeth Wonnacott
5. Kate Nation
This article has no evaluationsLatest version Feb 4, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Learning from the input: a corpus-based investigation of Chinese classifiers in children’s books and child-directed speech

Learning from the input: a corpus-based investigation of Chinese classifiers in children’s books and child-directed speech

Learning from the input: a corpus-based investigation of Chinese classifiers in children’s books and child-directed speech