HDFF-Net: A Hybrid Dual-Feature Fusion Network with Cross-Modal Attention for Automated Colposcopic Transformation Zone Classification

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Cervical cancer screening through colposcopy depends critically on accurate classification of the transformation zone (TZ) according to the 2011 IFCPC nomenclature, as TZ type directly governs treatment eligibility (ablative versus excisional). Inter-rater agreement among trained colposcopists for TZ-type assignment is only κ ≈ 0.42–0.60, well below the κ ≥ 0.80 threshold considered reliable for clinical decision-making. Existing automated methods rely either on handcrafted feature pipelines or deep transfer learning in isolation, each with well-documented limitations on small clinical datasets. Methods We present HDFF-Net (Hybrid Dual-Feature Fusion Network), a novel dual-stream deep learning architecture that unifies multi-scale handcrafted texture descriptors with an EfficientNetB0 convolutional backbone through a bidirectional Cross-Modal Attention (CMA) fusion module. The handcrafted branch computes a 24,508-dimensional composite descriptor (MS-GLCM, MS-LBP, MS-HOG, extended Gabor bank, first-order statistical features) and applies a novel 1D Squeeze-and-Excitation (SE) attention module for learned feature recalibration. The CNN branch augments EfficientNetB0 with CBAM channel attention. Training incorporates SMOTEENN combined resampling, label-smoothing cross-entropy (ε = 0.1), AdamW optimisation, and three-stage progressive fine-tuning. A 2-level stacked ensemble (HDFF-Net + SVM + XGBoost + RF + LR meta-learner via 5-fold out-of-fold stacking) is additionally proposed. All experiments were conducted on 366 acetic-acid colposcopic images (TZ1:TZ2:TZ3 = 232:61:73) with a held-out 30% test partition. Results HDFF-Net achieves 99.01% test accuracy, macro-F1 of 98.94%, and macro-AUC of 0.9985 (DeLong 95% CI: [0.9971–0.9998]) on the held-out test set (n = 209). The stacked ensemble achieves 99.28% accuracy and macro-AUC of 0.9987 (CI: [0.9974–0.9999]). Both results are statistically superior to the best prior baseline (SVM: 97.13%; McNemar χ² = 18.7, p < 0.001, Bonferroni-corrected). Ablation analysis identifies CMA fusion as the single largest contributor (+ 1.34 percentage points over SE-branch-only SVM), confirming the complementarity of handcrafted texture and CNN spatial representations. Conclusions HDFF-Net establishes a new state-of-the-art for three-class IFCPC TZ-type classification, achieving an error rate (0.99%) below documented human expert disagreement rates. The GPU-optional SE+handcrafted branch (98.56% accuracy, CPU-only) is particularly relevant for deployment in low-resource clinical settings. The architecture, training strategy, and stacking framework are generalisable to other colposcopy classification tasks.

Article activity feed