Multi-Modal Contextual Reasoning for Neurological Disease Diagnosis with Vision-Language-Tabular Transformers
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate diagnosis of neurological disorders requires integrating information from medical images, clinical text, and structured patient data. Existing vision-language models struggle with effective fusion and contextual reasoning across these modalities, especially for conditions with limited data. To address this, we propose NeuroDiag-VLT (Neuro-Diagnostic Vision-Language-Tabular Transformer), a framework for comprehensive cross-modal understanding and diagnostic inference. NeuroDiag-VLT consists of two stages: multi-modal feature extraction and alignment using a dedicated tabular encoder with a vision-language backbone, followed by context-aware fusion and instruction tuning. A Context-Aware Fusion Module dynamically models inter-modal interactions, while a Multi-Modal Consistency Loss enhances robustness and reduces diagnostic hallucinations. We curate extensive medical datasets, including vision-language, clinical text, and synthetic tabular data, as well as an expert-annotated neurological diagnosis dataset for instruction tuning. Experiments show that NeuroDiag-VLT surpasses state-of-the-art medical vision-language models in report generation, abnormality detection, visual question answering, and multi-modal classification. Ablation and human evaluation results demonstrate the effectiveness of the proposed components and the clinical relevance of the generated explanations, while efficiency analysis highlights its strong performance with reduced computational cost.