Can Large Language Models generate useful linguistic corpora? A case study of the word frequency effect in young German readers

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Linguistic corpora are an essential resource in psycholinguistic research. Here, we generate new corpora using large language models (LLMs) and determine their usefulness for estimating the word frequency effect on reading performance, focusing on German children. We prompted three different LLMs to create corpora of children's stories using the titles of 500 books, mimicking an existing corpus of children's books (childLex). In Experiment 1, we found that word frequency correlated strongly between childLex and the LLM corpora, despite a lower lexical richness of LLM text. Compared to childLex, we found that the estimated effect size of the LLM-based word frequency effect was lower, but that it explained more variance in reading performance (using reaction times for about 1000 words in a lexical decision task). In Experiment 2, we found that prompting for children-directed text results in word frequency that better fits to child compared to adult reading times, and also that increasing temperature can increase lexical richness. In Experiment 3, we replicated Experiment 1 using two open-weight LLMs. Across all 10 corpora (out of which 9 were LLM-based), we found that corpora with lower lexical richness generally fit better to reaction times. We discuss the potential of this approach, considering the risks associated with utilizing highly complex large language models (LLMs)

Article activity feed