Failure-Aware Robustness Evaluation of Deep Learning Models for Tuberculosis Detection Under Real-World Chest X-Ray Degradation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Deep learning–based systems have demonstrated promising performance for automated tuberculosis (TB) detection from chest X-ray (CXR) images and are increasingly proposed for large-scale screening applications. However, most evaluations rely on high-quality, curated images and do not adequately represent the degraded imaging conditions encountered in routine clinical practice, particularly in resource-limited settings. This study presents a failure-aware robustness evaluation of convolutional neural network (CNN) models for TB detection under realistic CXR degradation scenarios. Results: Three CNN architectures—ResNet-50, DenseNet-121, and MobileNetV2-were evaluated using two publicly available TB CXR datasets comprising approximately 800 images. Clinically relevant image degradations, including Gaussian noise, motion blur, compression artifacts, reduced contrast, and spatial resolution loss, were synthetically applied to test data only. All models exhibited statistically significant performance degradation under adverse conditions. Motion blur was the most detrimental artifacts, causing sensitivity reductions of up to 21%. Confidence calibration also deteriorated substantially, with expected calibration error increasing from approximately 0.04 on clean images to over 0.10 under degraded conditions. Conclusions: The findings demonstrate that AI-based TB detection models are vulnerable to silent failure when deployed under realistic imaging conditions. Robustness and calibration evaluation under degraded inputs should be considered a prerequisite for the responsible clinical deployment of AI-assisted TB screening systems, particularly in resource-constrained environments.