Cost-Performance Evaluation of Large Language Models for Aspect-Based Sentiment Analysis of HCAHPS Patient Comments: A Validation Study

Khalid Nawab
Gretchen Ramsey
Samina Asfandiyar
Sayuj Atreya
Shadi Hijjawi
Sharatkumar Rokkam
Usman Ghayur
Akarshana Rajesh
Ihtesham Yousuf
Zefaf Ali Shah
Amit Kumar Misra
Madhushan Ponnala
Tauseef Hamid
Richard Schreiber

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Hospital Consumer Assessment of Healthcare Providers and Systems (HCAHPS) free-text comments contain actionable feedback, but timely, scalable, and affordable sentiment analysis remains challenging for health systems that rely on third-party vendors.

Objectives

To evaluate cost-performance tradeoffs between a cost-optimized and a flagship large language model (LLM) for aspect-based sentiment analysis of HCAHPS comments, using human inter-rater agreement as a reproducibility benchmark.

Methods

We analyzed 512 free-text HCAHPS comments collected from two community hospitals in calendar year 2023. Six trained reviewers (medical students, recent medical graduates, and practicing internists) independently assigned positive, negative, or neutral labels to each comment-aspect pair; the majority label among three reviewers formed the consensus reference standard. Two OpenAI models — GPT-5-nano (cost-optimized) and GPT-5 (flagship) — were prompted in a zero-shot setting via the OpenAI API. We calculated pairwise Cohen’s κ to establish a human inter-rater baseline, then compared each model’s labels to the consensus using Cohen’s κ, accuracy, weighted F1, and per-call cost and latency.

Results

Mean human inter-rater agreement was κ = 0.79 (substantial). Both LLMs exceeded this baseline (cost-optimized κ = 0.85; flagship κ = 0.85) with nearly identical accuracy (0.92) and weighted F1 (0.93 vs. 0.93). Performance was strong on positive (F1 ≈ 0.97) and negative (F1 ≈ 0.90) classes but poor on the underrepresented neutral class (F1 ≤ 0.19). The cost-optimized model processed all 512 comments for $0.04 versus $0.18 for the flagship — a 4.2-fold cost difference without measurable performance gain.

Conclusions

Commercially available LLMs can perform aspect-based sentiment analysis on HCAHPS comments at human-level reproducibility, with the cost-optimized tier sufficient for routine classification. This offers health systems a rapid, scalable, low-cost alternative to vendor-based patient-experience analytics.

Version published to 10.64898/2026.06.11.26355494 on medRxiv
Jun 15, 2026

Relationship Extraction for Adverse Drug Events in Clinical Notes Using Large Language Models

This article has 10 authors:
1. Joseph M Plasek
2. Yiming Li
3. Mary G Amato
4. Dinah Foer
5. Diane L. Seger
6. Shayma Alzaidi
7. Huiyuan Zhou
8. Gretchen Purcell Jackson
9. David W Bates
10. Li Zhou
This article has no evaluationsLatest version Jun 1, 2026
Performance of Google NotebookLM for AI-assisted data extraction and consensus statement generation in a heterogenous systematic review on inflammatory bowel disease, obesity, and cardiometabolic comorbidities: A Methodological Report

This article has 11 authors:
1. Sami Samaan
2. Jalpa Devi
3. Matthew Vincent
4. Shannon Coombs
5. Priya Sehgal
6. Mouhand Mouhamed
7. Victoria Rai
8. Amanda M. Johnson
9. Andres J. Yarur
10. Edward L. Barnes
11. Parakkal Deepak
This article has no evaluationsLatest version Jun 26, 2026
NigBench: A multilingual point-of-care medical query benchmarking study of large language models in Nigeria

This article has 18 authors:
1. Tobi Olatunji
2. Chinemelu Aka
3. Chibuzor Okocha
4. Emmanuel Ayodele
5. Jennifer Orisakwe
6. Toni Adekunle
7. Mardhiyah Sanni
8. Abdulameed Abiola
9. Tassallah Abdullahi
10. Oluwatomi Owopetu
11. Tolu Afolaranmi
12. Peter Suoyo Yougha
13. Mira Emmanuel-Fabula
14. Vaishnavi Menon
15. Alastair Denniston
16. Xiao Liu
17. Gwydion Williams
18. Bilal A. Mateen
This article has no evaluationsLatest version Jul 10, 2026

Discuss this preprint

Listed in

Abstract

Background

Objectives

Methods

Results

Conclusions

Article activity feed

Related articles

Relationship Extraction for Adverse Drug Events in Clinical Notes Using Large Language Models

Performance of Google NotebookLM for AI-assisted data extraction and consensus statement generation in a heterogenous systematic review on inflammatory bowel disease, obesity, and cardiometabolic comorbidities: A Methodological Report

NigBench: A multilingual point-of-care medical query benchmarking study of large language models in Nigeria