When ChatGPT-4o Is (Less) Human-Like: Preliminary Subjective Rating Tests for Psycholinguistic Research

Takumi Kosaka
Aoi Kikkawa

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This brief report explores the use of large language models, especially ChatGPT-4o, for preliminary subjective rating tests on multiword units for psycholinguistic research. We asked GPT-4o to rate multiword units on their Idiomaticity, Meaningfulness, and event Plausibility. A series of correlation analyses showed that while all GPT-generated rating scores significantly correlated with human raters, the strength of correlation varied among the tests. Specifically, the correlation coefficient for Plausibility was significantly the lowest, but no significant difference was found between Idiomaticity and Meaningfulness. Moreover, we used the GPT-generated Idiomaticity and Meaningfulness scores to replicate the statistical results of Jolsvai et al. (2020), showing no comparable results at all to the original study. The potential use and limitations of ChatGPT-4o for psycholinguistic research are discussed.

Version published to 10.31234/osf.io/d9wmp on OSF Preprints
Sep 16, 2024

Test-Retest Reliability of Core Lexicon Analysis

This article has 3 authors:
1. Sarah Grace Dalton
2. Rob Cavanaugh
3. Brielle C Stark
This article has no evaluationsLatest version Oct 10, 2025
Contextual Assembly of Lexical Functions in Large Language Models

This article has 3 authors:
1. Chris Kello
2. Polyphony Bruna
3. Kanly Thao
This article has no evaluationsLatest version Sep 23, 2025
Engaging Stories for the Study of Attention and Audition (ESSAA): A Database of Engagement Scores for Narrative Stimuli

This article has 3 authors:
1. Lauren Petley
2. Sarah Jane Gascoyne
3. Myleen Hoffman
This article has no evaluationsLatest version Oct 7, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Test-Retest Reliability of Core Lexicon Analysis

Contextual Assembly of Lexical Functions in Large Language Models

Engaging Stories for the Study of Attention and Audition (ESSAA): A Database of Engagement Scores for Narrative Stimuli