Machine Learning Prediction of Incident Hypertension Using Baseline Biomarkers: Evidence from the CHARLS Cohort

Jingwei Li
Zhongyang Song
Qian Xu
Guoxiong Hao
Fan Zou
Xiali Liang
Xixi Huang
Zexiang Zhang
Zhiming Zhang

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Hypertension poses a significant public health challenge in China and globally, substantially contributing to cardiovascular morbidity and mortality. Early identification of individuals at high risk is essential for effective preventive strategies. This study aimed to develop and validate machine learning (ML) models to predict incident hypertension among middle-aged and older Chinese adults. The predictive models integrated traditional risk factors with novel baseline biomarkers, including C-reactive protein (CRP), uric acid (UA), cystatin C, and the triglyceride-glucose (TyG) index. Additionally, survival analysis was conducted to evaluate the time-to-event aspect of hypertension onset. Methods This longitudinal cohort study analyzed data from 4,948 initially normotensive adults aged ≥ 45 years from the China Health and Retirement Longitudinal Study (CHARLS), with baseline assessments conducted in 2011 and follow-up continuing through 2020. Incident hypertension was defined by a composite outcome of self-reported physician diagnosis, elevated measured blood pressure values (systolic ≥ 140 mmHg or diastolic ≥ 90 mmHg), or the use of antihypertensive medication. Missing predictor data were addressed through multiple imputation techniques. We developed and validated four machine learning (ML) models—Logistic Regression (LR), Random Forest (RF), XGBoost, and Support Vector Machine with a linear kernel (SVM-Linear). All models underwent training using repeated 10-fold cross-validation, and their predictive performances were evaluated on an independent test dataset using multiple metrics, including ROC AUC, accuracy, sensitivity, specificity, F1-score, and Cohen’s Kappa, following optimization of classification thresholds. To enhance model interpretability, SHapley Additive exPlanations (SHAP) values were utilized to identify feature importance in the XGBoost model. Additionally, Kaplan-Meier survival analysis and Cox proportional hazards models were applied to evaluate time-to-event outcomes. For predictors violating the proportional hazards assumption—such as the TyG index and the Center for Epidemiologic Studies Depression Scale (CES-D10) score—time-varying coefficients were incorporated into the Cox models. Results During a median follow-up period of 9.0 years, hypertension developed in 1,851 participants (37.4% of the cohort). Following optimization of classification thresholds, the XGBoost algorithm demonstrated superior predictive performance on the independent test set compared to other models, achieving an area under the receiver operating characteristic curve (AUC) of 0.710, with accuracy, sensitivity, specificity, and F1-score values of 0.664, 0.652, 0.671, and 0.592, respectively. Baseline systolic blood pressure, age, TyG index, and body mass index (BMI) were identified as predominant predictors in both the machine learning analyses (quantified by SHAP values for XGBoost) and traditional Cox regression models. Time-dependent survival analyses revealed that elevated baseline TyG index and CES-D10 scores were associated with progressively increasing hazard ratios for incident hypertension over time (P for time interaction < 0.001 for both variables). Additionally, Kaplan-Meier survival curves showed significantly lower hypertension-free survival probabilities among participants in the highest quartiles of the TyG index (log-rank P < 0.001) and among those with elevated baseline CRP concentrations (log-rank P < 0.001). Conclusion Integrating traditional risk factors with novel biomarkers into machine learning algorithms, particularly XGBoost, provided moderate predictive capability for incident hypertension among middle-aged and older Chinese adults. Predictive performance was substantially enhanced by optimizing classification thresholds. Baseline systolic blood pressure, age, TyG index, and scores from the CES-D10 emerged as key predictors of hypertension onset. Notably, the TyG index and CES-D10 scores demonstrated significant time-dependent effects on hypertension risk, highlighting potential dynamic pathophysiological mechanisms. These findings contribute to risk stratification efforts aimed at early hypertension prevention and provide valuable insights into the temporal dynamics of metabolic and psychological factors in hypertension pathogenesis. Future interventional studies targeting these modifiable risk factors are warranted to confirm their causal roles in hypertension development and inform personalized preventive strategies.

Version published to 10.21203/rs.3.rs-6824239/v1 on Research Square
Jun 23, 2025

Machine Learning Prediction of MACE in Older Chinese Adults Integrating Traditional and Geriatric-Specific Risk Factors: A CHARLS Cohort Analysis

This article has 11 authors:
1. Jingwei Li
2. Zhongyang Song
3. Fan Zou
4. Yiming Hu
5. Qian Xu
6. Guoxiong Hao
7. Xiali Liang
8. Zexiang Zhang
9. Xixi Huang
10. Guanwei Wang
11. Zhiming Zhang
This article has no evaluationsLatest version Jul 22, 2025
Enhancing CVD Risk Prediction: Integrating ECG Signals with Conventional Models Using AI

This article has 6 authors:
1. Maryam Mahdavi
2. Anoshirvan Kazemnejad
3. Abbas Asosheh
4. Davood Khalili
5. Kamyab Hosseinpour
6. Ahmadreza Tajari
This article has no evaluationsLatest version Jun 26, 2025
Prediction of Post-Stroke Depression Using Inflammatory Markers and Functional Status: A Machine Learning Approach

This article has 6 authors:
1. Zhenyi Qin
2. Wei Luo
3. Xianhao Li
4. Mengying Cao
5. Jing Zheng
6. Lixing Dai
This article has no evaluationsLatest version Jul 14, 2025

Listed in

Abstract

Article activity feed

Related articles

Machine Learning Prediction of MACE in Older Chinese Adults Integrating Traditional and Geriatric-Specific Risk Factors: A CHARLS Cohort Analysis

Enhancing CVD Risk Prediction: Integrating ECG Signals with Conventional Models Using AI

Prediction of Post-Stroke Depression Using Inflammatory Markers and Functional Status: A Machine Learning Approach