Plasma proteomic profiles predict chronic obstructive pulmonary disease up to 16 years before onset: a multi-national, machine learning-guided biomarker discovery study
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Chronic obstructive pulmonary disease (COPD) remains a major public health burden, yet early risk prediction remains limited. Using Cox regression and multi-machine learning, we analyzed plasma proteomic data from 36,906 UK Biobank participants and identified nine proteins including GDF15, WFDC2, SCGB1A1, CXCL17, CA14, EDA2R, TNR, AGER, and ODAM. The 9-protein model achieved high accuracy for predicting COPD across different time frames (area under the curve [AUC] = 0.83 overall; 0.86 within 5 years; 0.84 within 10 years; 0.77 beyond 10 years) in a geographically defined UKB testing cohort (n = 15,607), and were further validated in the external EPIC-Norfolk cohort (n = 2,944) with similarly high AUCs. Consistent results were observed in the Southern China cohort (n = 100). Incorporating clinical factors further improved the predictive accuracy, achieving maximum AUCs of 0.89 overall, 0.91 for 5-year prediction, 0.89 for 10-year prediction and 0.83 for prediction beyond 10 years. Individuals with higher baseline protein levels had an 7.29-fold increased COPD risk, and proteomic alterations were detectable up to 16 years before diagnosis. All nine proteins showed significant positive genetic correlations with COPD and causal inference analyses further supported roles for CXCL17 and AGER. These findings demonstrate that plasma proteomics enables accurate long-term COPD risk prediction across diverse populations, provides new insights into disease mechanisms, and supports early identification of high-risk individuals for targeted prevention.