Integrating multi-scale cross-attention and graph-guided label reasoning for multi-label chest X-ray classification
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Multi-label chest X-ray (CXR) classification is challenging because abnormalities span a diverse set of spatial scales and disease labels are strongly interdependent. We develop a visual–semantic framework that jointly models multi-scale visual fusion and label-prior-guided decoding.The visual encoder has two parallel branches: a Vision Transformer (ViT) branch captures global anatomical context, whereas a DenseNet-121 branch extracts local texture cues from intermediate convolutional stages.We align and fuse the two representations using a multi-scale bidirectional cross-attention module.To model label dependencies more explicitly, we build a label graph using semantic label embeddings and training-set co-occurrence statistics and apply a graph convolutional network (GCN) to generate label embeddings that initialize the Transformer decoder’s label queries.On ChestX-ray14 and CheXpert, we achieved mean areas under the ROC curve (AUCs) of 0.849 and 0.815, respectively.Qualitative visualizations further indicate better alignment between label queries and disease-relevant regions in selected examples.Overall, our results suggest that integrating global and local visual evidence with explicit label priors improves multi-label CXR classification.