Evaluating Large Language Model Performance and Reliability in Scoring Picture Description Tasks for Neuropsychological Assessment

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Picture description tasks, such as the Cookie Theft task, are widely used in neuropsychological assessments to detect cognitive impairment. However, manual scoring is time-consuming, requires specialized training, and is subject to interrater variability. Recent advancements in natural language processing, particularly large language models (LLMs), offer a promising solution to automate and standardize the scoring process.Methods: This study evaluated the performance and reliability of five LLMs (GPT-4 Turbo, GPT-4o, Claude 3 Opus, Claude 3 Sonnet, and Llama 3 70b) in scoring the Cookie Theft picture description task. A subset of 25 participants were selected from the DementiaBank corpus. The LLMs were tasked with scoring 22 content units in the participants’ responses using various prompt strategies, including few-shot learning, prompt chaining, and self-consistency. LLM performance was compared to the consensus score of three human raters.Results: LLMs demonstrated comparable accuracy to human raters in scoring the Cookie Theft task, with no significant differences in mean absolute error (MAE) between the best performing models and human raters. Few-shot learning significantly improved LLM performance, while prompt chaining and self-consistency showed limited benefits. Claude 3 Opus and GPT-4o exhibited the highest accuracy and reliability. Notably, LLMs showed significantly higher interrater reliability compared to human raters.Conclusion: The findings demonstrate the potential of LLMs to accurately and reliably score picture description tasks, offering a promising approach to streamline and standardize neuropsychological assessments. By automating the scoring process, clinicians and researchers can benefit from increased efficiency, reduced subjectivity, and improved scalability in evaluating cognitive functions.

Article activity feed