Training Set Augmentation and Harmonization Enables Radiomic Models to Detect Early Onset of Lung Cancer
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Radiomics-based machine learning models have the potential to detect lung cancer at inception from CT scans and transform patient outcomes. Low malignancy rates in early-development pulmonary nodules (PNs) and variable image acquisition hinder development of clinically applicable radiomics-based early detection models. To address these challenges, we augmented training using later-development PNs and harmonized for acquisition effects. We first trained machine learning models to predict PN malignancy using radiomic features from scans of early-development benign and malignant PNs (n = 187) harmonized using ComBat. Observing near-chance performance, we augmented training with later-development benign and malignant PNs (n = 225). We evaluated whether harmonization must incorporate biological differences that impact acquisition effects in added training data. To correct features for variability in four acquisition parameters, we compared: 1) harmonization without biological distinction, 2) harmonizing with a covariate distinguishing early-development, benign augmentation, malignant augmentation training datasets, 3) harmonizing each dataset separately. Models trained using augmented data harmonized without biological distinction failed to improve. Models trained on augmented data harmonized with a covariate (ROC-AUC 0.72 [0.67–0.76]) or separately (ROC-AUC 0.69 [0.63–0.74]) achieved significantly higher test ROC-AUC (Delong test, adjusted p ≤ 0.05). Our findings lay groundwork for clinically viable radiomics tools harnessing routine screening imaging for lung cancer early detection.