Machine Learning Prediction of Incident Hypertension Using Baseline Biomarkers: Evidence from the CHARLS Cohort
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Hypertension poses a significant public health challenge in China and globally, substantially contributing to cardiovascular morbidity and mortality. Early identification of individuals at high risk is essential for effective preventive strategies. This study aimed to develop and validate machine learning (ML) models to predict incident hypertension among middle-aged and older Chinese adults. The predictive models integrated traditional risk factors with novel baseline biomarkers, including C-reactive protein (CRP), uric acid (UA), cystatin C, and the triglyceride-glucose (TyG) index. Additionally, survival analysis was conducted to evaluate the time-to-event aspect of hypertension onset. Methods This longitudinal cohort study analyzed data from 4,948 initially normotensive adults aged ≥ 45 years from the China Health and Retirement Longitudinal Study (CHARLS), with baseline assessments conducted in 2011 and follow-up continuing through 2020. Incident hypertension was defined by a composite outcome of self-reported physician diagnosis, elevated measured blood pressure values (systolic ≥ 140 mmHg or diastolic ≥ 90 mmHg), or the use of antihypertensive medication. Missing predictor data were addressed through multiple imputation techniques. We developed and validated four machine learning (ML) models—Logistic Regression (LR), Random Forest (RF), XGBoost, and Support Vector Machine with a linear kernel (SVM-Linear). All models underwent training using repeated 10-fold cross-validation, and their predictive performances were evaluated on an independent test dataset using multiple metrics, including ROC AUC, accuracy, sensitivity, specificity, F1-score, and Cohen’s Kappa, following optimization of classification thresholds. To enhance model interpretability, SHapley Additive exPlanations (SHAP) values were utilized to identify feature importance in the XGBoost model. Additionally, Kaplan-Meier survival analysis and Cox proportional hazards models were applied to evaluate time-to-event outcomes. For predictors violating the proportional hazards assumption—such as the TyG index and the Center for Epidemiologic Studies Depression Scale (CES-D10) score—time-varying coefficients were incorporated into the Cox models. Results During a median follow-up period of 9.0 years, hypertension developed in 1,851 participants (37.4% of the cohort). Following optimization of classification thresholds, the XGBoost algorithm demonstrated superior predictive performance on the independent test set compared to other models, achieving an area under the receiver operating characteristic curve (AUC) of 0.710, with accuracy, sensitivity, specificity, and F1-score values of 0.664, 0.652, 0.671, and 0.592, respectively. Baseline systolic blood pressure, age, TyG index, and body mass index (BMI) were identified as predominant predictors in both the machine learning analyses (quantified by SHAP values for XGBoost) and traditional Cox regression models. Time-dependent survival analyses revealed that elevated baseline TyG index and CES-D10 scores were associated with progressively increasing hazard ratios for incident hypertension over time (P for time interaction < 0.001 for both variables). Additionally, Kaplan-Meier survival curves showed significantly lower hypertension-free survival probabilities among participants in the highest quartiles of the TyG index (log-rank P < 0.001) and among those with elevated baseline CRP concentrations (log-rank P < 0.001). Conclusion Integrating traditional risk factors with novel biomarkers into machine learning algorithms, particularly XGBoost, provided moderate predictive capability for incident hypertension among middle-aged and older Chinese adults. Predictive performance was substantially enhanced by optimizing classification thresholds. Baseline systolic blood pressure, age, TyG index, and scores from the CES-D10 emerged as key predictors of hypertension onset. Notably, the TyG index and CES-D10 scores demonstrated significant time-dependent effects on hypertension risk, highlighting potential dynamic pathophysiological mechanisms. These findings contribute to risk stratification efforts aimed at early hypertension prevention and provide valuable insights into the temporal dynamics of metabolic and psychological factors in hypertension pathogenesis. Future interventional studies targeting these modifiable risk factors are warranted to confirm their causal roles in hypertension development and inform personalized preventive strategies.