Core vocabulary reveals differences between human word prediction and large language models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The question of which words are the most central or important to a language has been explored in various ways. In this study, we propose definitions of core vocabulary that are based on how language is learned, represented, and processed from psychological perspectives, and test these on a word prediction task. We aim to (1) compare core vocabulary based on word frequency in natural language, the content of word associations,and age-of-acquisition in terms of how well they are guessed in word prediction contexts, and (2) investigate the extent to which word prediction in language models aligns with humans, and if there are systematic differences between them, whether these can be captured by core vocabulary measures. Across two experiments, 867 participants completed a task which involved guessing target words that were missing from sentence contexts. Natural language-based core words were generally easier to guess, but when the degree to which these words were naturally predictable in the linguistic environment was taken into account, word association- and acquisition-based core words were easier to predict for reasons that went beyond this. Additionally, language models were able to account for people’s word prediction responses to a considerable extent, but there were also systematic deviations from these predictions which were able to be captured by word association-based coreness. The findings suggest that distributional relationships between words in text is not all there is to human word prediction, but that people may also rely on factors like communicative usefulness and multimodal or extralinguistic information.