Juggling internal and external factors when modeling natural language data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This final chapter takes a broader look at the role of internal and external factors in the statistical analysis of natural language data. The distinction, it will be argued, helps us handle issues that are bound to arise when dealing with observational data. The focus will be on two problems that can arise in corpus-based variationist research: (i) representativeness, that is, the question of whether the sample mirrors the target population on relevant characteristics, and (ii) (multi-)collinearity, that is, a non-negligible level of association between predictor variables. This chapter demonstrates that the internal–external dichotomy allows us to approach these issues in a linguistically informed way. We start by arguing that the two types of factors differ fundamentally as regards questions of representativeness. Then we describe how the dichotomy allows us to deal with collinearity issues in a principled manner, using a model of the causal relations between predictor variables. As we will see, the non-trivial task of outlining such a model is facilitated by the distinction between internal and external predictor variables. The aim of this chapter is to submit a number of broader heuristics that may provide orientation for corpus-based research on language variation and change.