Case-control down-sampling in corpus research
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
When corpus researchers are forced to down-size their data, different down-sampling techniques can be used to increase the amount of information in the subset of corpus hits. In alternation studies, one strategy is the selection of tokens based on the observed realization of the outcome variable. A survey of down-sampling work shows that such response-sensitive designs are used at a noticeable rate in corpus research. In the health sciences, where this approach is referred to as a case-control design, a rich methodology has evolved over the course of the past 60 years. Corpus linguists should therefore be actively sounding out the potential for methodological transfer to our field. The present paper takes a step in this direction and pursues three goals. The first is to overcome terminological barriers by making transparent the peculiar jargon associated with this research design. Further, I will provide an overview of some principles of study design and data analysis that form the core of this method, and provide illustrative analyses using data on the English dative alternation. Finally, I will use the insights provided by our survey to identify distinctive features of case-control down-sampling. This allows for a focused exploration of the extensive literature surrounding the approach, and also provide guideposts for future methodological work.