A Systematic Evaluation of Dutch Large Language Models’ Surprisal Estimates in Sentence, Paragraph, and Book Reading
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Studies using computational estimates of word predictability from neural language models have garnered strong evidence in favour of surprisal theory. Upon encountering a word, readers experience a processing difficulty that is a linear function of that word’s surprisal. Evidence for this effect has been established in the English language or using multilingual models to estimate surprisal across languages. At the same time, many language-specific models of unknown psychometric quality are made openly available. Here, we provide a systematic evaluation of the surprisal estimates of a collection of large language models, specifically designed for Dutch, examining how well they account for reading times. We compare their performance to a multilingual model (mGPT) and an N-gram model. Across three eye-tracking corpora, a Dutch model predicted reading times better than the multilingual model. Dutch large language models replicate the general inverse scaling trend observed for English, with the surprisal estimates of smaller models showing a better fit to reading times than those of the largest models, however, this effect depends partly on the corpus used to evaluate the model. Surprisingly, in contrast to the linear effect of surprisal on reading times observed in other corpora, for the GECO corpus a non-linear link fitted the data best. Overall, these results offer a psychometric leaderboard of Dutch large language models and challenge the notion of a ubiquitous linear effect of surprisal. The complete set of surprisal estimates derived from all neural language models across the three corpora, along with the code to extract the surprisal, is made publicly available (https://osf.io/wr4qf/).