“Double Machine Learning for Causal Inference in High-Dimensional Electronic Health Records”

Mike Du
Yuchen Guo
Xintong Li
Marti Catala
Daniel Pareto-Alhambra

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Estimating causal effects in observational health data is challenging due to confounding by indication. Traditional approaches such as inverse probability of treatment weighting (IPTW) rely on correct model specification, which is difficult in high-dimensional settings. We implemented an offset-based double machine learning (Offset-DML) practical framework for estimating binary treatment effects on the log-odds scale using logistic regression.

Methods

We have conducted a plasmode simulation study based on real-world clinical data, varying sample sizes (5,000, 10,000, 20,000) and outcome prevalence (5%, 10%, 20%) with 200 repetitions. We compared the performance of IPTW, stabilised IPTW, offset-DML (with and without cross-fitting), and high-dimensional DML (HD-DML). We measured and compared the performance of the different models with the following metrics: absolute bias, empirical standard error, and root mean square error relative to the true average causal effect.

Results

Across most scenarios, DML-based approaches outperformed IPTW methods in terms of bias and empirical standard error, particularly in larger sample sizes. Offset-DML showed comparable performance to HD-DML while avoiding convergence issues observed with HD-DML in sparse data settings. All DML methods had overlapping confidence intervals in most scenarios.

Conclusion

Offset-DML is a practical and robust alternative for causal inference in high-dimensional health data. Future work should investigate extensions to other outcomes and diagnostics to assess confounding control.

Key messages

Double machine learning based methods consistently outperform IPTW regarding bias and empirical standard error, particularly in large sample sizes and sparse-data scenarios.
Offset Double machine learning is a practical and robust binary causal effect estimation method in high-dimensional settings.
Unlike high-dimensional Double machine learning, the offset-based Double machine learning approach demonstrated consistent convergence across all scenarios, including those with low outcome prevalence and small sample sizes.

Version published to 10.1101/2025.07.21.25331944 on medRxiv
Jul 22, 2025

Triangulated causal inference with deep counterfactual learningfor individualized statin-associated type 2 diabetes risk

This article has 13 authors:
1. Hao Zhou
2. Jorge Passamani Zubelli
3. Haralampos Hatzikirou
4. Andreas Henschel
5. Laurent Alain Najman
6. Daniel E. Platt
7. Antonello Maruotti
8. Siobhan O’Sullivan
9. Lithe Basbous
10. Cynthia Al Hageh
11. Mariam AlHarbi
12. Antoine Abchee
13. Pierre Zalloua
This article has no evaluationsLatest version Jan 27, 2026
Semiparametric Outcome Regression-Based Estimator of Mann-Whitney-type Causal Effect

This article has 12 authors:
1. Safiya S. Sani
2. Bryan S. Blette
3. Chun Li
4. Abubakar Yahaya
5. Hussaini G. Dikko
6. Abubakar Usman
7. Usman J. Wudil
8. Faisal Dankishiya
9. Nafi’u Hussaini
10. C. William Wester
11. Muktar H. Aliyu
12. Bryan E. Shepherd
This article has no evaluationsLatest version Feb 2, 2026
Heterogeneous Treatment Effect Estimation with Instrumental Variable Methods

This article has 3 authors:
1. Amir Aamodt Kazemi
2. Joseph Sexton
3. Inge Christoffer Olsen
This article has no evaluationsLatest version Jan 13, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusion

Key messages

Article activity feed

Related articles

Triangulated causal inference with deep counterfactual learningfor individualized statin-associated type 2 diabetes risk

Semiparametric Outcome Regression-Based Estimator of Mann-Whitney-type Causal Effect

Heterogeneous Treatment Effect Estimation with Instrumental Variable Methods