Cost-Performance Evaluation of Large Language Models for Aspect-Based Sentiment Analysis of HCAHPS Patient Comments: A Validation Study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Hospital Consumer Assessment of Healthcare Providers and Systems (HCAHPS) free-text comments contain actionable feedback, but timely, scalable, and affordable sentiment analysis remains challenging for health systems that rely on third-party vendors.

Objectives

To evaluate cost-performance tradeoffs between a cost-optimized and a flagship large language model (LLM) for aspect-based sentiment analysis of HCAHPS comments, using human inter-rater agreement as a reproducibility benchmark.

Methods

We analyzed 512 free-text HCAHPS comments collected from two community hospitals in calendar year 2023. Six trained reviewers (medical students, recent medical graduates, and practicing internists) independently assigned positive, negative, or neutral labels to each comment-aspect pair; the majority label among three reviewers formed the consensus reference standard. Two OpenAI models — GPT-5-nano (cost-optimized) and GPT-5 (flagship) — were prompted in a zero-shot setting via the OpenAI API. We calculated pairwise Cohen’s κ to establish a human inter-rater baseline, then compared each model’s labels to the consensus using Cohen’s κ, accuracy, weighted F1, and per-call cost and latency.

Results

Mean human inter-rater agreement was κ = 0.79 (substantial). Both LLMs exceeded this baseline (cost-optimized κ = 0.85; flagship κ = 0.85) with nearly identical accuracy (0.92) and weighted F1 (0.93 vs. 0.93). Performance was strong on positive (F1 ≈ 0.97) and negative (F1 ≈ 0.90) classes but poor on the underrepresented neutral class (F1 ≤ 0.19). The cost-optimized model processed all 512 comments for $0.04 versus $0.18 for the flagship — a 4.2-fold cost difference without measurable performance gain.

Conclusions

Commercially available LLMs can perform aspect-based sentiment analysis on HCAHPS comments at human-level reproducibility, with the cost-optimized tier sufficient for routine classification. This offers health systems a rapid, scalable, low-cost alternative to vendor-based patient-experience analytics.

Article activity feed