A Multimodal Attention-Based Multi-Instance Learning Framework for Fair and Interpretable Pediatric Teledermatology
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Purpose : Pediatric skin diseases are prevalent yet frequently underdiagnosed in low-resource settings across Sub-Saharan Africa due to limited access to specialized dermatological care. This study examines whether a subject-level multimodal learning framework can improve diagnostic accuracy, interpretability, and fairness in pediatric teledermatology across diverse skin types. Methods : A subject-level multimodal multi-instance learning framework is developed in which each patient is represented as a bag of clinical images, with visual features integrated alongside demographic and clinical metadata. A gated attention mechanism is employed to aggregate heterogeneous image instances into interpretable subject-level representations, while multimodal fusion provides contextual information for diagnosis. The framework is evaluated using the PASSION pediatric dermatology dataset across four common skin conditions. Ablation studies and statistical analyses are conducted to assess the contributions of attention-based aggregation and multimodal fusion. Fairness is evaluated across Fitzpatrick skin types. Results : The proposed framework achieves an overall classification accuracy of 82.8\% and a macro F1-score of 0.81. Ablation results demonstrate that gated attention-based aggregation significantly outperforms naive pooling strategies, while multimodal fusion further enhances diagnostic robustness. Fairness analysis indicates stable performance across Fitzpatrick skin types. Conclusion : Subject-level multimodal learning provides a robust, interpretable, and equitable approach for AI-assisted pediatric teledermatology, demonstrating strong potential for improving diagnostic access and quality of care in low-resource clinical environments.