CMViT: A Cross-Modal Pretrained Vision Transformer for Simultaneous Caries and Periapical Lesion Detection in Radiographs
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The subjective interpretation of periapical radiographs for dental caries and apical periodontitis remains a diagnostic challenge, often failing to detect subtle lesions. To address this, we developed an interpretable deep learning framework that leverages a novel cross-modal self-supervised pretraining strategy on 21 587 unlabeled panoramic single-tooth images, followed by fine-tuning on 6 457 expert-annotated periapical radiographs. Our model employed a dual-head Vision Transformer architecture to simultaneously yet separately classify caries and periapical lesion (PL). This approach significantly enhanced performance, elevating the F1-score from 0.81 to 0.91, with sensitivity rising from 0.76 to 0.90 and specificity from 0.88 to 0.93 compared to training from scratch. The dual-head design also surpassed a single-head, four-class classifier in disease-level accuracy, with notable improvements in complex cases such as teeth with both conditions. Attention rollout heatmaps confirmed that predictions were based on anatomically plausible regions, validating the model's clinical interpretability. Our work demonstrated that cross-modal pretraining combined with a task-specific architecture creates a highly accurate and trustworthy tool for tooth-level diagnosis and quality assurance.