Head and Neck Tumour Segmentation in PET Images: Performance Evaluation of 3D U-Net with Maximum Voting-Based Surrogate Ground Truth
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate segmentation of head and neck tumours in PET images is critical for effective treatment planning, disease progression monitoring, and radiotherapy. However, achieving reliable ground truth data remains challenging due to inter- and intra-observer variability. The U-Net, a Deep Convolutional Neural Network (DCNN), has demonstrated strong potential for automated segmentation, yet the lack of definitive ground truth for training limits its full effectiveness. This study investigates the performance of a 3D U-Net deep learning framework trained using surrogate ground truth masks created through maximum voting (MV) from manual and semi-automatic segmentations. The analysis utilized the QIN-HEADNECK dataset, comprising a total of 217 lesions of 59 PET/CT scans. Each lesion was annotated by three radiologists, with two trials conducted for each segmentation method, resulting in a total of twelve trials. Three MV-based surrogate masks: M-MV (manual), SA-MV (semi-automatic), and A-MV (combined) were then generated. Segmentation performance was assessed using the Dice Similarity Coefficient (DSC). The results revealed that manual and semi-automatic segmentations achieved average DSC scores of 0.75 and 0.91, respectively, when compared to each MV result for individual trials. The combined MV (A-MV) produced DSC scores of 0.74 and 0.85 when compared to the MV results from the manual and semi-automatic segmentations, respectively. Among the 3D U-Net models, the framework trained with SA-MV achieved the highest average DSC score of 0.86, performing similarly to A-MV (0.86) and surpassing the model trained with M-MV (0.83). While the 3D U-Net outperformed manual segmentation (DSC of 0.83 vs. 0.75), its performance was still lower than that of semi-automatic segmentation (DSC of 0.86 vs. 0.91). These findings highlight the reliability of semi-automatic segmentation methods in producing consistent results, despite the added time required for their implementation. The study also indicates that while deep learning models are effective in standardizing and automating processes, their performance can be further enhanced by refining training datasets and pre-processing techniques. Additionally, the incorporation of advanced ground truth generation methods could significantly improve segmentation accuracy and increase the clinical applicability of these models.