From Correlation to Causation: Causal Machine Learning for Mining Candidate Gene on Genotype-Phenotype Association Data

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Identifying candidate genes with true causal effects is crucial for uncovering the genetic mechanisms of complex traits and advancing crop improvement. Traditional approaches such as genome-wide association studies and machine learning are primarily correlation-based. Although these methods have revealed numerous genotype–phenotype associations, they often fail to distinguish indirect associations caused by linkage disequilibrium or confounding factors from true causal effects. To overcome this limitation and achieve a shift from correlation to causation, we propose a two-stage framework that integrates ensemble learning with double machine learning to uncover candidate genes with potential causal roles. In the first stage, important SNPs are prioritized using multiple ensemble models. In the second stage, the causal effects of these SNPs are rigorously estimated while adjusting for high-dimensional confounders, thereby revealing their true genetic contributions to complex traits and providing reliable targets for molecular breeding. When applied to maize genotype–phenotype data, the framework not only identifies biologically meaningful single nucleotide polymorphisms but also highlights candidate genes associated with key traits. The experimental results demonstrate a robust and interpretable strategy for causal gene discovery, bridging the gap between statistical association and biological causality, and opening new avenues for crop genomics and genetic improvement. The code, and its usage are also given (https://github.com/YaxinZhang230/DML).

Article activity feed