Differences in Automated Staging of Second and Third Molars: An Autoencoder and Vision Transformer-based Interpretability Analysis

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Dental age estimation is a critical part of forensic odontology applications. It is conventionally carried out by experts on dental orthopantomograms manually. Deep learning models are the current state-of-the-art in automated dental age estimation applications. When applying deep learning models to high-stakes scenarios, such as forensic investigations, understanding the model’s behaviour is critical. In this context, we investigate a scenario where the automated staging of tooth 37 and tooth 38, the left second and third mandibular molars, with a vision transformer (ViT) model displays a striking performance disparity, achieving accuracies of 0.64 and 0.37, respectively. We further explore the self-attention mechanism of the ViT in search of the reason for this disparity. Furthermore, to introduce additional transparency, we propose a pipeline consisting of a convolutional autoencoder (AE) trained with triplet loss, and a ViT model for classification. This pipeline achieves accuracies of 0.72 and 0.39 on teeth 37 and 38. We further explore the latent space, image reconstructions and attention maps, contrasting the two teeth in order to uncover the reason for the poor performance on tooth 38. Through probing our pipeline, we reveal the problem of high intra-class variation in the tooth 38 dataset, and demonstrate an approach that can offer increased transparency in deep learning applications on medical image data.

Article activity feed