Design-Aware Predictive and Causal Modeling of Cardiovascular Risk in Chronic Kidney Disease Using Penalized and Double Machine Learning Approaches

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

We develop a design-aware framework that combines penalized prediction and causal inference for finite populations observed through complex survey designs. The framework integrates survey-weighted pseudo-likelihoods, ℓ1-penalized estimation, Neyman-orthogonal moment functions, and a bootstrap procedure that resamples primary sampling units within strata. Methodologically, the contribution is an explicit pipeline that supports design-based inference while separating predictive associations from structurally adjusted effects in high-dimensional, clustered data. We illustrate the framework using data from the Chilean National Health Survey (ENS) 2016–2017 to study the relationship between chronic kidney disease (CKD) and high cardiovascular (CV) risk. In the ENS adult population, the survey-weighted prevalence of CKD was 3.1% (95% CI: 2.4–3.8), and the prevalence of high CV risk was 23.9% (95% CI: 21.5–26.3). High CV risk was markedly more frequent among individuals with CKD than among those without CKD (90.9% versus 21.5%). Predictive and associational analyses combined survey-weighted penalized logistic regression (LASSO) with refitted unpenalized models. In conventional survey-weighted logistic regressions, CKD showed a strong association with high CV risk (odds ratio = 5.66; 95% CI: 2.71–11.82; p<0.001), and effect sizes remained stable after LASSO-based variable selection. To assess causal relevance under confounding and potential endogeneity, we implemented two endogeneity-aware estimators: two-stage residual inclusion (2SRI) and double/debiased machine learning (DML). The DML estimator, defined as the primary causal estimand, reports an orthogonalized estimate of the average treatment effect of CKD on the probability of high CV risk. After adjustment for age and major cardiometabolic comorbidities, the DML estimate was attenuated and statistically non-significant (average treatment effect = −0.094; 95% CI: [−0.409,0.220]). The 2SRI approach yielded unstable estimates with wide confidence intervals, consistent with the limited effective sample size of CKD cases (nCKD≈190 in a sample with n ≈ 6233) and weak identification conditions under low-prevalence settings. Simulation experiments under ENS-like complex sampling suggest that naive predictive associations may overestimate the structural contribution of CKD under confounding, whereas orthogonalized estimators yield more conservative estimates when identification holds. The causal interpretation relies on a conditional mean independence assumption given observed covariates and survey design, while control-function specifications are treated as diagnostic sensitivity analyses due to the absence of credible exclusion-based instruments. Overall, the results demonstrate a fundamental divergence between predictive relevance and causal importance in finite-population settings, underscoring the need for design-aware and endogeneity-robust methods in statistical modeling.

Article activity feed