The Classical Model of Type-Token Systems Compared with Items from the Standardized Project Gutenberg Corpus

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

We compare the “classical” equations of type-token systems, namely Zipf’s laws, Heaps’ law and the relationships between their indices, with data selected from the Standardized Project Gutenberg Corpus (SPGC). Selected items all exceed 100,000 word-tokens and are trimmed to 100,000 word-tokens each. With the most egregious anomalies removed, a dataset of 8,432 items is examined in terms of the relationships between the Zipf and Heaps indices computed using the Maximum Likelihood algorithm. Zipf’s second (size) law indices suggest that the types vs. frequency distribution is log-log convex, the high and low frequency indices showing weak but significant negative correlation. Under certain circumstances the classical equations work tolerably well, though the level of agreement depends heavily on the type of literature and the language (Finnish being notably anomalous). The frequency vs. rank characteristics exhibit log-log linearity in the “middle range” (ranks 100-1000), as characterized by the Kolmogorov-Smirnoff significance. For most items, the Heaps’ index correlates strongly with the low-frequency Zipf index in a manner consistent with classical theory, while the high frequency indices are largely uncorrelated. This is consistent with a simple simulation.

Article activity feed