Prediction of Chronic Obstructive Pulmonary Disease Using Machine Learning Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Chronic Obstructive Pulmonary Disease (COPD) is a progressive lung condition that causes restricted airflow and breathing problems in patients. The disease has remained a leading cause of mortality worldwide, yet early prediction of at-risk individuals remains a challenge. Traditional diagnostic approaches rely on symptomatic assessment or expensive, inaccessible clinical tests rather than predictive modeling, which delays disease intervention. This study explores the potential of machine learning in predicting COPD risk by utilizing the extensive All of Us database, which provides diverse health data. Using a cohort of 42,941 individuals, we extracted demographic, lifestyle, and clinical features that are relevant to COPD susceptibility in the literature. Extensive data processing techniques were utilized that involved handling missing values, feature selection, and normalization. Feature importance analysis highlighted smoking history, environmental exposures, and comorbidities as key contributors to COPD risk. Various machine learning algorithms, including random forest, multi-layer perceptron, and support vector machine, were trained and validated to assess the predictive performance of our framework. Performance evaluation based on accuracy and area under the receiver operating characteristic curve (AUC-ROC) metrics indicates that the random forest model outperformed the conventional statistical methods with an accuracy of 83% and an AUC-ROC of 0.89. While some prior studies report higher AUC-ROC, those often rely on specialized data (e.g., imaging, genetic, or questionnaire-based inputs) and small or imbalanced datasets. In contrast, our model achieves competitive performance using a reduced, accessible clinical feature set across a large, diverse cohort. Our findings suggest that machine learning-based predictive models can greatly enhance the early identification of at-risk individuals to allow targeted interventions if needed. By integrating such predictive analytics into healthcare systems, we hope to shift focus to more proactive risk mitigation in COPD care.