Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset

Shruti Hegde
Mabon Ninan
Jonathan R. Dillman
Shireen Hayatghaibi
Lynn Babcock
Elanchezhian Somasundaram

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study compares four commercial clinical NLP tools - Amazon Comprehend Medical, Google Healthcare NLP, Azure Clinical NLP, and SparkNLP - alongside dedicated radiograph labelers CheXpert and CheXbert for pediatric chest radiograph (CXR) report labeling. Using 95,008 pediatric CXR reports from a large academic hospital, we extracted entities and assertion statuses (positive, negative, uncertain) from findings and impressions, mapped them to 13 categories (12 disease categories and a No Findings category), and compared performance using Fleiss Kappa and accuracy against a pseudo-ground truth. Entity extraction varied widely: SparkNLP extracted 49,688 unique entities, Azure 31,543, AWS 27,216, and Google 16,477. Assertion accuracy ranged from 50% (AWS) to 76% (SparkNLP), while CheXpert and CheXbert achieved 56%. Results reveal substantial performance variability, emphasizing the need for validation and careful review before deploying NLP tools for pediatric clinical report labeling.

Version published to 10.21203/rs.3.rs-6772394/v1 on Research Square
Jun 16, 2025

LabQAR: A Manually Curated Dataset for Question Answering on Laboratory Test Reference Ranges and Interpretation

This article has 7 authors:
1. Balu Bhasuran
2. Qiao Jin
3. Angelique Deville
4. Yonghui Wu
5. Karim Hanna
6. Zhiyong Lu
7. Zhe He
This article has no evaluationsLatest version Jun 3, 2025
A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

This article has 8 authors:
1. Hao Guan
2. Peter C. Hou
3. Pengyu Hong
4. Liqin Wang
5. Wenyu Zhang
6. Xinsong Du
7. Zhengyang Zhou
8. Li Zhou
This article has no evaluationsLatest version Jul 14, 2025
RAGnosis: Retrieval-Augmented Generation for Enhanced Medical Decision Making

This article has 5 authors:
1. Amir Rouhollahi
2. Ali Homaei
3. Aanchal Sahu
4. Rayan Ebnali Harari
5. Farhad R. Nezami
This article has no evaluationsLatest version Jun 12, 2025

Listed in

Abstract

Article activity feed

Related articles

LabQAR: A Manually Curated Dataset for Question Answering on Laboratory Test Reference Ranges and Interpretation

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

RAGnosis: Retrieval-Augmented Generation for Enhanced Medical Decision Making