Training Set Augmentation and Harmonization Enables Radiomic Models to Detect Early Onset of Lung Cancer

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Radiomics-based machine learning models have the potential to detect lung cancer at inception from CT scans and transform patient outcomes. Low malignancy rates in early-development pulmonary nodules (PNs) and variable image acquisition hinder development of clinically applicable radiomics-based early detection models. To address these challenges, we augmented training using later-development PNs and harmonized for acquisition effects. We first trained machine learning models to predict PN malignancy using radiomic features from scans of early-development benign and malignant PNs (n = 187) harmonized using ComBat. Observing near-chance performance, we augmented training with later-development benign and malignant PNs (n = 225). We evaluated whether harmonization must incorporate biological differences that impact acquisition effects in added training data. To correct features for variability in four acquisition parameters, we compared: 1) harmonization without biological distinction, 2) harmonizing with a covariate distinguishing early-development, benign augmentation, malignant augmentation training datasets, 3) harmonizing each dataset separately. Models trained using augmented data harmonized without biological distinction failed to improve. Models trained on augmented data harmonized with a covariate (ROC-AUC 0.72 [0.67–0.76]) or separately (ROC-AUC 0.69 [0.63–0.74]) achieved significantly higher test ROC-AUC (Delong test, adjusted p ≤ 0.05). Our findings lay groundwork for clinically viable radiomics tools harnessing routine screening imaging for lung cancer early detection.

Article activity feed