Metabolomic Profiling of Dried Blood Spots for Breast Cancer Detection: A Multi-Classifier Validation Study in 2,734 Participants

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Breast cancer (BC) remains the most commonly diagnosed malignancy and the leading cause of cancer-related mortality among women worldwide. Although blood-based untar-geted metabolomics has emerged as a promising modality for detecting early-stage BC, its clinical translation has been bottlenecked by two unresolved issues: (i) the field has almost exclusively relied on serum or plasma, which require venipuncture and cold-chain logistics, and (ii) machine-learning models reported on such data are frequently validated with protocols that are blind to analytical batch structure, producing optimistically biased performance estimates.

Methods

We conducted a breast cancer detection study using dried blood spots (DBS), a minimally invasive matrix compatible with self-collection and ambient-temperature shipping. A cohort of 2,734 participants (114 biopsy-confirmed BC cases; 2,620 non-cancer controls) was profiled by untargeted LC-MS/MS using a Thermo Scientific Orbitrap IQ-X coupled to a Vanquish UHPLC system. A 39-metabolite panel meeting MSI Level 1 identification criteria [1] was pre-specified a priori from the published breast-cancer metabolomics literature, frozen prior to LC-MS acquisition, and applied to the present cohort without any feature selection on the data. Six standard supervised-learning architectures (LASSO, Elastic Net, Linear SVM, PLS-DA, OPLS-DA, XGBoost) were evaluated on this pre-specified panel; OPLS-DA, whose pyopls implementation does not integrate cleanly into the repeated multi-seed batch-aware protocol, is reported only in the sex-matched subgroup analysis where a single-seed 5-fold stratified protocol permits a directly comparable fit. Per-batch control-median normalization is applied upstream, following the protocol of the companion same-lab study [2], which removes batch-specific intensity shifts at the data-preparation stage; kNN imputation, log transform, and robust scaling are then fit within each training fold. The evaluation battery comprises batch-aware StratifiedGroupKFold CV reported at single-seed (seed=42) with inter-seed SD quantified across 10 independent seeds, batch-aware nested CV, a 100-seed held-out 20%-batch validation with disjoint-batch isotonic probability calibration (30% calibration partition), PPV/NPV reporting at multiple operating points and three deployment prevalences, subgroup analyses by TNM stage and tumor grade, pathway-ablation sensitivity analysis, and a 1,000-iteration permutation test.

Results

Under batch-aware evaluation (StratifiedGroupKFold, single-seed=42), AUC ranged from 0.914 to 0.949 across classifiers, with LASSO achieving an AUC of 0.928 and XGBoost 0.949; inter-seed SD across 10 seeds was 0.002-0.006. At 95% specificity, sensitivity reached 75.4% for LASSO and 81.6% for XGBoost. Held-out batch validation across 100 seeds yielded mean AUC values of 0.912 for Elastic Net and 0.935 for XGBoost, supporting robust generalization across analytical batches. All 39 panel features showed high coefficient stability, and permutation testing on representative classifiers (LASSO, Linear SVM, PLS-DA) confirmed statistical significance ( p ≤ 0.001). Subgroup analyses showed lower detection performance for stage IIA tumors (AUC 0.87, n=40) compared with stage IIB/IIIA (AUC 0.95), suggesting stronger systemic metabolic signatures in more advanced disease. Bootstrap coefficient consistency of the Elastic Net classifier confirmed that all 39 panel features received a non-zero multivariate weight in >=80% of 100 stratified bootstraps. Permutation testing on the three representative classifiers subjected to this analysis (LASSO, Linear SVM, PLS-DA) confirmed significance at p ≤ 0.001 in all three cases.

Conclusions

In this cohort of diagnosed, pre-treatment breast-cancer cases, DBS LC-MS metabolomic profiling demonstrated robust classification performance across multiple classifier families and biological pathways. The DBS matrix is minimally invasive, self-collectable by finger-prick, and compatible with ambient temperature shipping, making it attractive for decentralized and remote-care settings. This strategy may complement the established venous-blood workflow while addressing important accessibility and logistical barriers identified over nearly a decade of preliminary work [3, 4]. Performance is weaker on stage IIA than on more advanced disease, and prospective validation in an independent asymptomatic screening cohort is required before clinical positioning as a decentralized triage modality.

Article activity feed