AI-generated estimates of Dutch words and expressions
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study introduces and validates GPT_FAM, an AI-generated resource of familiarity estimates for 935,000 Dutch words and 201,000 multiword expressions. Based on previous studies, we hypothesized that such estimates, particularly when fine-tuned using a few thousand human ratings, would offer a useful, scalable measure of verbal knowledge. The results confirmed the expectation, showing that fine-tuned GPT estimates correlate well with word prevalence, reflecting the likelihood of word recognition. Equally importantly, GPT_FAM estimates significantly predict response latencies in lexical decision tasks, emerging as the most robust predictor in a random forest analysis alongside word frequency and length. The measure may be especially useful for assessing the difficulty of morphologically complex items, such as inflected word forms and transparent compounds, where traditional frequency metrics tend to be ineffective. Both untuned and fine-tuned estimates are freely available for research and educational purposes.