External Validation of an Artificial Intelligence Triaging System for Chest X-Rays: A Retrospective Independent Clinical Study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Chest radiography (CXR) is the most frequently performed radiological exam worldwide, but reporting backlogs, caused by a shortage of radiologists, remain a critical challenge in emergency care. Artificial intelligence (AI) triage systems can help alleviate this challenge by differentiating normal from abnormal studies and prioritizing urgent cases for review. This study aimed to externally validate TRIA, a commercial AI-powered CXR triage algorithm (NeuralMed, São Paulo, Brazil). Methods: TRIA employs a two-stage deep learning approach, comprising an image segmentation module that isolates the thoracic region, followed by a classification model trained to recognize common cardiopulmonary pathologies. We trained the system on 275,399 CXRs from multiple public and private datasets. We performed external validation retrospectively on 1045 CXRs (568 normal and 477 abnormal) from a teaching university hospital that was not used for training. We established ground truth using a large language model (LLM) to extract findings from original radiologist reports. An independent radiologist review of a 300-report subset confirmed the reliability of this method, achieving an accuracy of 0.98 (95% CI 0.978–0.988). We compared four ensemble decision strategies for abnormality detection. Performance metrics included sensitivity, specificity, accuracy, and area under the receiver operating characteristic curve (AUROC) with 95% CI. Results: The general abnormality classifier achieved strong performance (AUROC 0.911). Individual pathology models for cardiomegaly, pneumothorax, and effusion showed excellent results (AUROC of 0.968, 0.955, and 0.935, respectively). The weighted ensemble demonstrated the best balance, with an accuracy of 0.854 (95% CI, 0.831–0.874), a sensitivity of 0.845 (0.810–0.875), a specificity of 0.861 (0.830–0.887), and an AUROC of 0.927 (0.911–0.940). Sensitivity-prioritized methods achieving sensitivity >0.92 produced lower specificity (<0.69). False negatives were mainly subtle or equivocal cases, although many were still flagged as abnormal by the general classifier. Conclusions: TRIA achieved robust and balanced accuracy in distinguishing normal from abnormal CXRs. Integrating this system into clinical workflows has the potential to reduce reporting delays, prioritize urgent cases, and improve patient safety. These findings support its clinical utility and warrant prospective multicenter validation.

Article activity feed