Design-Aware Predictive and Causal Modeling of Cardiovascular Risk in Chronic Kidney Disease Using Penalized and Double Machine Learning Approaches

Fernando Rojas
Axa Tapia
Hilda Espinoza

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We develop a design-aware framework that combines penalized prediction and causal inference for finite populations observed through complex survey designs. The framework integrates survey-weighted pseudo-likelihoods, ℓ1-penalized estimation, Neyman-orthogonal moment functions, and a bootstrap procedure that resamples primary sampling units within strata. Methodologically, the contribution is an explicit pipeline that supports design-based inference while separating predictive associations from structurally adjusted effects in high-dimensional, clustered data. We illustrate the framework using data from the Chilean National Health Survey (ENS) 2016–2017 to study the relationship between chronic kidney disease (CKD) and high cardiovascular (CV) risk. In the ENS adult population, the survey-weighted prevalence of CKD was 3.1% (95% CI: 2.4–3.8), and the prevalence of high CV risk was 23.9% (95% CI: 21.5–26.3). High CV risk was markedly more frequent among individuals with CKD than among those without CKD (90.9% versus 21.5%). Predictive and associational analyses combined survey-weighted penalized logistic regression (LASSO) with refitted unpenalized models. In conventional survey-weighted logistic regressions, CKD showed a strong association with high CV risk (odds ratio = 5.66; 95% CI: 2.71–11.82; p<0.001), and effect sizes remained stable after LASSO-based variable selection. To assess causal relevance under confounding and potential endogeneity, we implemented two endogeneity-aware estimators: two-stage residual inclusion (2SRI) and double/debiased machine learning (DML). The DML estimator, defined as the primary causal estimand, reports an orthogonalized estimate of the average treatment effect of CKD on the probability of high CV risk. After adjustment for age and major cardiometabolic comorbidities, the DML estimate was attenuated and statistically non-significant (average treatment effect = −0.094; 95% CI: [−0.409,0.220]). The 2SRI approach yielded unstable estimates with wide confidence intervals, consistent with the limited effective sample size of CKD cases (nCKD≈190 in a sample with n ≈ 6233) and weak identification conditions under low-prevalence settings. Simulation experiments under ENS-like complex sampling suggest that naive predictive associations may overestimate the structural contribution of CKD under confounding, whereas orthogonalized estimators yield more conservative estimates when identification holds. The causal interpretation relies on a conditional mean independence assumption given observed covariates and survey design, while control-function specifications are treated as diagnostic sensitivity analyses due to the absence of credible exclusion-based instruments. Overall, the results demonstrate a fundamental divergence between predictive relevance and causal importance in finite-population settings, underscoring the need for design-aware and endogeneity-robust methods in statistical modeling.

Version published to 10.3390/math14091554
May 4, 2026
Version published to 10.20944/preprints202604.0571.v1
Apr 9, 2026

A Machine Learning–Driven Health Risk Index for Predicting Chronic Disease Burden

This article has 1 author:
1. Ved Sharma
This article has no evaluationsLatest version Apr 2, 2026
Benchmarking of Ensembles and Meta‐Ensembles in the Multiclass Classification of Obesity Risk: Predictive Performance, Calibration and Interpretability

This article has 5 authors:
1. Daniel Andrade-Girón
2. William Marin-Rodriguez
3. Américo Peña
4. Elsa Oscuvilca-Tapia
5. Fredy Bermejo-Sanchez
This article has no evaluationsLatest version Apr 10, 2026
Prognostic Gamma-Power Generalized Regression Modelling of Determinants Influencing Variations in Under-Five Mortality Rate

This article has 5 authors:
1. Joseph Adekunle Akinyemi
2. Matthew Iwada Ekum
3. Oluwatosin Jonadab Akinsola
4. Patricia Eyanya Akintan
5. Jimoh Ishola Taylor
This article has no evaluationsLatest version Apr 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Machine Learning–Driven Health Risk Index for Predicting Chronic Disease Burden

Benchmarking of Ensembles and Meta‐Ensembles in the Multiclass Classification of Obesity Risk: Predictive Performance, Calibration and Interpretability

Prognostic Gamma-Power Generalized Regression Modelling of Determinants Influencing Variations in Under-Five Mortality Rate