Core vocabulary reveals differences between human word prediction and large language models

Andrew Wang
Simon De Deyne
Meredith McKague
Andrew Perfors

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The question of which words are the most central or important to a language has been explored in various ways. In this study, we propose definitions of core vocabulary that are based on how language is learned, represented, and processed from psychological perspectives, and test these on a word prediction task. We aim to (1) compare core vocabulary based on word frequency in natural language, the content of word associations,and age-of-acquisition in terms of how well they are guessed in word prediction contexts, and (2) investigate the extent to which word prediction in language models aligns with humans, and if there are systematic differences between them, whether these can be captured by core vocabulary measures. Across two experiments, 867 participants completed a task which involved guessing target words that were missing from sentence contexts. Natural language-based core words were generally easier to guess, but when the degree to which these words were naturally predictable in the linguistic environment was taken into account, word association- and acquisition-based core words were easier to predict for reasons that went beyond this. Additionally, language models were able to account for people’s word prediction responses to a considerable extent, but there were also systematic deviations from these predictions which were able to be captured by word association-based coreness. The findings suggest that distributional relationships between words in text is not all there is to human word prediction, but that people may also rely on factors like communicative usefulness and multimodal or extralinguistic information.

Version published to 10.31234/osf.io/dpgac_v1 on OSF Preprints
Aug 29, 2025

Contextual Assembly of Lexical Functions in Large Language Models

This article has 3 authors:
1. Chris Kello
2. Polyphony Bruna
3. Kanly Thao
This article has no evaluationsLatest version Sep 23, 2025
Adding volition to word processing: Expected utility norms for 80 thousand English words and multiword expressions

This article has 3 authors:
1. Andrew Wang
2. Marc Brysbaert
3. Fritz Guenther
This article has no evaluationsLatest version Oct 7, 2025
Relative Importance of Lexical Features in Word Processing During L2 English Reading

This article has 2 authors:
1. Shingo Nahatame
2. Satoru Uchida
This article has no evaluationsLatest version Sep 4, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Contextual Assembly of Lexical Functions in Large Language Models

Adding volition to word processing: Expected utility norms for 80 thousand English words and multiword expressions

Relative Importance of Lexical Features in Word Processing During L2 English Reading