Down-sampling strategies in corpus phonology
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Corpus-based work in segmental phonology is often forced to down-size the pool of relevant tokens to a manageable subset. The standard approach in corpus software is to select a random sample of observations. This is inefficient in corpus phonology, however, since tokens are usually clustered by Speaker and Word. Previous work has shown that such data layouts are preferably scaled down using structured down-sampling. This scheme aims for balanced token distributions across speakers and lexical items and is particularly useful for the study of speaker- and item-level predictors. Using a case study on voice onset time, the present chapter adopts a simulation approach to extend the evaluation of down-sampling designs to predictor variables measured at the level of the individual corpus hits. Similar to earlier work, a reference model fit to the full set of 20,194 tokens serves as a benchmark, which allows us to compare different design in terms of accuracy and statistical precision. We observe that while structured down-sampling performs better for word-level predictors, the picture for token-level features is mixed. Further, we introduce the notion of a down-sampling sequence, which avoids a-priori decisions on down-sample size and allows for a dynamic evaluation of methods. We note that, for our illustrative data, gains in accuracy and precision diminish for down-samples exceeding 25% of the full set of data. Finally, we address an apparent weakness of structured down-sampling, which returns inferior estimates of random-effects parameters in mixed-effects regression models. We show how a more sophisticated implementation of structured down-sampling may help overcome this limitation.