Can Large Language Models generate useful linguistic corpora? A case study of the word frequency effect in young German readers

Job Schepens
Hanna Woloszyn
Nicole Marx
Benjamin Gagl

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Linguistic corpora are an essential resource in psycholinguistic research. Here, we generate new corpora using large language models (LLMs) and determine their usefulness for estimating the word frequency effect on reading performance, focusing on German children. We prompted three different LLMs to create corpora of children's stories using the titles of 500 books, mimicking an existing corpus of children's books (childLex). In Experiment 1, we found that word frequency correlated strongly between childLex and the LLM corpora, despite a lower lexical richness of LLM text. Compared to childLex, we found that the estimated effect size of the LLM-based word frequency effect was lower, but that it explained more variance in reading performance (using reaction times for about 1000 words in a lexical decision task). In Experiment 2, we found that prompting for children-directed text results in word frequency that better fits to child compared to adult reading times, and also that increasing temperature can increase lexical richness. In Experiment 3, we replicated Experiment 1 using two open-weight LLMs. Across all 10 corpora (out of which 9 were LLM-based), we found that corpora with lower lexical richness generally fit better to reaction times. We discuss the potential of this approach, considering the risks associated with utilizing highly complex large language models (LLMs)

Version published to 10.31234/osf.io/gm9b6_v5 on OSF Preprints
Jun 13, 2025
Version published to 10.31234/osf.io/gm9b6_v4 on OSF Preprints
Jun 13, 2025
Version published to 10.31234/osf.io/gm9b6_v2 on OSF Preprints
Jun 12, 2025
Version published to 10.31234/osf.io/gm9b6_v3 on OSF Preprints
Jun 12, 2025
Version published to 10.31234/osf.io/gm9b6_v1 on OSF Preprints
Oct 20, 2023

Core vocabulary reveals differences between human word prediction and large language models

This article has 4 authors:
1. Andrew Wang
2. Simon De Deyne
3. Meredith McKague
4. Andrew Perfors
This article has no evaluationsLatest version Aug 29, 2025
Trading Accuracy for Fluency? An investigation of word retrieval difficulties in connected speech

This article has 2 authors:
1. Amber Römkens
2. Aurélie Pistono
This article has no evaluationsLatest version Jul 15, 2025
Revisiting the Relation Between Statistical Learning and Literacy: Reliable Measures of Sensitivity to Bigram Frequency Predict Spelling but not Reading Skill

This article has 3 authors:
1. Haoyu Zhou
2. Fabienne Chetail
3. Louisa Bogaerts
This article has no evaluationsLatest version Jul 23, 2025

Listed in

Abstract

Article activity feed

Related articles

Core vocabulary reveals differences between human word prediction and large language models

Trading Accuracy for Fluency? An investigation of word retrieval difficulties in connected speech

Revisiting the Relation Between Statistical Learning and Literacy: Reliable Measures of Sensitivity to Bigram Frequency Predict Spelling but not Reading Skill