“Double Machine Learning for Causal Inference in High-Dimensional Electronic Health Records”

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Estimating causal effects in observational health data is challenging due to confounding by indication. Traditional approaches such as inverse probability of treatment weighting (IPTW) rely on correct model specification, which is difficult in high-dimensional settings. We implemented an offset-based double machine learning (Offset-DML) practical framework for estimating binary treatment effects on the log-odds scale using logistic regression.

Methods

We have conducted a plasmode simulation study based on real-world clinical data, varying sample sizes (5,000, 10,000, 20,000) and outcome prevalence (5%, 10%, 20%) with 200 repetitions. We compared the performance of IPTW, stabilised IPTW, offset-DML (with and without cross-fitting), and high-dimensional DML (HD-DML). We measured and compared the performance of the different models with the following metrics: absolute bias, empirical standard error, and root mean square error relative to the true average causal effect.

Results

Across most scenarios, DML-based approaches outperformed IPTW methods in terms of bias and empirical standard error, particularly in larger sample sizes. Offset-DML showed comparable performance to HD-DML while avoiding convergence issues observed with HD-DML in sparse data settings. All DML methods had overlapping confidence intervals in most scenarios.

Conclusion

Offset-DML is a practical and robust alternative for causal inference in high-dimensional health data. Future work should investigate extensions to other outcomes and diagnostics to assess confounding control.

Key messages

  • Double machine learning based methods consistently outperform IPTW regarding bias and empirical standard error, particularly in large sample sizes and sparse-data scenarios.

  • Offset Double machine learning is a practical and robust binary causal effect estimation method in high-dimensional settings.

  • Unlike high-dimensional Double machine learning, the offset-based Double machine learning approach demonstrated consistent convergence across all scenarios, including those with low outcome prevalence and small sample sizes.

Article activity feed