From Correlation to Causation: Causal Machine Learning for Mining Candidate Gene on Genotype-Phenotype Association Data

Yaxin Zhang
Yu Song
Quanling Zhao
Deqing Peng
Han Qiao
Lichao Peng
Xiaohui Yang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Identifying candidate genes with true causal effects is crucial for uncovering the genetic mechanisms of complex traits and advancing crop improvement. Traditional approaches such as genome-wide association studies and machine learning are primarily correlation-based. Although these methods have revealed numerous genotype–phenotype associations, they often fail to distinguish indirect associations caused by linkage disequilibrium or confounding factors from true causal effects. To overcome this limitation and achieve a shift from correlation to causation, we propose a two-stage framework that integrates ensemble learning with double machine learning to uncover candidate genes with potential causal roles. In the first stage, important SNPs are prioritized using multiple ensemble models. In the second stage, the causal effects of these SNPs are rigorously estimated while adjusting for high-dimensional confounders, thereby revealing their true genetic contributions to complex traits and providing reliable targets for molecular breeding. When applied to maize genotype–phenotype data, the framework not only identifies biologically meaningful single nucleotide polymorphisms but also highlights candidate genes associated with key traits. The experimental results demonstrate a robust and interpretable strategy for causal gene discovery, bridging the gap between statistical association and biological causality, and opening new avenues for crop genomics and genetic improvement. The code, and its usage are also given (https://github.com/YaxinZhang230/DML).

Version published to 10.21203/rs.3.rs-7448320/v1 on Research Square
Sep 8, 2025

Path-Probability Models Outperform Point-Estimate Scores for Noncoding GWAS Gene Prioritization

This article has 1 author:
1. Abduxoliq Ashuraliyev
This article has no evaluationsLatest version Dec 22, 2025
Causal effect heterogeneity estimation using summary statistics

This article has 8 authors:
1. Xingjie Shi
2. Yadong Yang
3. Minxi Bai
4. Jiacheng Miao
5. Stephen Dorn
6. Jonathan Haugstad
7. Jin Liu
8. Qiongshi Lu
This article has no evaluationsLatest version Jan 14, 2026
Bayesian fine-mapping pinpoints candidate genes and pleiotropic loci of production traits from a chicken backcrossing scheme

This article has 8 authors:
1. Chi Mei Sun
2. Johannes Geibel
3. Henner Simianer
4. Björn Andersson
5. David Cavero
6. Rudolf Preisinger
7. Steffen Weigend
8. Christian Reimer
This article has no evaluationsLatest version Jan 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Path-Probability Models Outperform Point-Estimate Scores for Noncoding GWAS Gene Prioritization

Causal effect heterogeneity estimation using summary statistics

Bayesian fine-mapping pinpoints candidate genes and pleiotropic loci of production traits from a chicken backcrossing scheme