Does corpus size influence normalised frequencies?

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Several frequency-based measures are influenced by corpus size (e.g. lexical diversity or text similarity measures). It is largely unquestioned, however, that normalised frequencies correct for the influence of corpus size – but it has not yet been systematically tested whether and how they might be influenced by corpus size themselves. The central question is whether the normalised frequency of an element in a smaller corpus can be meaningfully compared to the normalised frequency of the same element in a larger corpus. We are testing the association between lists of normalised frequencies derived from corpus samples of different sizes from six languages. Our results suggest that the size of the underlying corpora does not negatively influence comparisons of normalised frequency lists, i.e. different corpus sizes do not lead to normalised frequencies no longer being comparable. For lower-frequency types, these associations decrease rather quickly. These empirical findings converge with predictions from statistical theory.

Article activity feed