Down-sampling strategies in corpus phonology

Lukas Sönning

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Corpus-based work in segmental phonology is often forced to down-size the pool of relevant tokens to a manageable subset. The standard approach in corpus software is to select a random sample of observations. This is inefficient in corpus phonology, however, since tokens are usually clustered by Speaker and Word. Previous work has shown that such data layouts are preferably scaled down using structured down-sampling. This scheme aims for balanced token distributions across speakers and lexical items and is particularly useful for the study of speaker- and item-level predictors. Using a case study on voice onset time, the present chapter adopts a simulation approach to extend the evaluation of down-sampling designs to predictor variables measured at the level of the individual corpus hits. Similar to earlier work, a reference model fit to the full set of 20,194 tokens serves as a benchmark, which allows us to compare different design in terms of accuracy and statistical precision. We observe that while structured down-sampling performs better for word-level predictors, the picture for token-level features is mixed. Further, we introduce the notion of a down-sampling sequence, which avoids a-priori decisions on down-sample size and allows for a dynamic evaluation of methods. We note that, for our illustrative data, gains in accuracy and precision diminish for down-samples exceeding 25% of the full set of data. Finally, we address an apparent weakness of structured down-sampling, which returns inferior estimates of random-effects parameters in mixed-effects regression models. We show how a more sophisticated implementation of structured down-sampling may help overcome this limitation.

Version published to 10.31219/osf.io/5hduq_v1 on OSF Preprints
Aug 1, 2025

The time-course of phonological encoding: insights from time-resolved MVPA

This article has 3 authors:
1. Giulia Li Calzi
2. antje meyer
3. Constantijn L van der Burght
This article has no evaluationsLatest version Aug 27, 2025
A review of computational models of word recognition and pronunciation

This article has 1 author:
1. Mary Alexandria Kelly
This article has no evaluationsLatest version Sep 6, 2025
LeCoder: A Large-Scale Automated Coder for Coding Errors in Word Production Tasks

This article has 4 authors:
1. Shanhua Hu
2. Delaney DuVal
3. Brielle C Stark
4. Nazbanou Nozari
This article has no evaluationsLatest version Jul 31, 2025

Listed in

Abstract

Article activity feed

Related articles

The time-course of phonological encoding: insights from time-resolved MVPA

A review of computational models of word recognition and pronunciation

LeCoder: A Large-Scale Automated Coder for Coding Errors in Word Production Tasks