Integrating multi-scale cross-attention and graph-guided label reasoning for multi-label chest X-ray classification

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Multi-label chest X-ray (CXR) classification is challenging because abnormalities span a diverse set of spatial scales and disease labels are strongly interdependent. We develop a visual–semantic framework that jointly models multi-scale visual fusion and label-prior-guided decoding.The visual encoder has two parallel branches: a Vision Transformer (ViT) branch captures global anatomical context, whereas a DenseNet-121 branch extracts local texture cues from intermediate convolutional stages.We align and fuse the two representations using a multi-scale bidirectional cross-attention module.To model label dependencies more explicitly, we build a label graph using semantic label embeddings and training-set co-occurrence statistics and apply a graph convolutional network (GCN) to generate label embeddings that initialize the Transformer decoder’s label queries.On ChestX-ray14 and CheXpert, we achieved mean areas under the ROC curve (AUCs) of 0.849 and 0.815, respectively.Qualitative visualizations further indicate better alignment between label queries and disease-relevant regions in selected examples.Overall, our results suggest that integrating global and local visual evidence with explicit label priors improves multi-label CXR classification.

Article activity feed