When ChatGPT-4o Is (Less) Human-Like: Preliminary Subjective Rating Tests for Psycholinguistic Research
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This brief report explores the use of large language models, especially ChatGPT-4o, for preliminary subjective rating tests on multiword units for psycholinguistic research. We asked GPT-4o to rate multiword units on their Idiomaticity, Meaningfulness, and event Plausibility. A series of correlation analyses showed that while all GPT-generated rating scores significantly correlated with human raters, the strength of correlation varied among the tests. Specifically, the correlation coefficient for Plausibility was significantly the lowest, but no significant difference was found between Idiomaticity and Meaningfulness. Moreover, we used the GPT-generated Idiomaticity and Meaningfulness scores to replicate the statistical results of Jolsvai et al. (2020), showing no comparable results at all to the original study. The potential use and limitations of ChatGPT-4o for psycholinguistic research are discussed.