A Transparent and Generalizable Deep Learning Framework for Genomic Ancestry Prediction

Camille Rochefort-Boulanger
Matthew Scicluna
Raphaël Poujol
Jean-Christophe Grenier
Pierre Luc Carrier
Sébastien Lemieux
Julie G Hussin

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Accurately capturing genetic ancestry is critical for ensuring reproducibility and fairness in genomic studies and downstream health research. This study aims to address the prediction of ancestry from genetic data using deep learning, with a focus on generalizability across datasets with diverse populations and on explainability to improve model transparency. We adapt the Diet Network, a deep learning architecture proven effective in handling high-dimensional data, to learn population ancestry from single nucleotide polymorphisms (SNPs) data using the populational Thousand Genomes Project dataset. Our results highlight the model’s ability to generalize to diverse populations in the CARTaGENE and Montreal Heart Institute biobanks and that predictions remain robust to high levels of missing SNPs. We show that, despite the lack of North African populations in the training dataset, the model learns latent representations that reflect meaningful population structure for North African individuals in the biobanks. To improve model transparency, we apply Saliency Maps, DeepLift, GradientShap and Integrated Gradients attribution techniques and evaluate their performance in identifying SNPs leveraged by the model. Using DeepLift, we show that model’s predictions are driven by population-specific signals consistent with those identified by traditional population genetics metrics. This work presents a generalizable and interpretable deep learning framework for genetic ancestry inference in large-scale biobanks with genetic data. By enabling more widespread genomic ancestry characterization in these cohorts, this study contributes practical tools for integrating genetic data into downstream biomedical applications, supporting more inclusive and equitable healthcare solutions.

Arcadia Science
Aug 31, 2025

You observed that for ambiguous cases or high-levels of missing data, the model tended to predict the PUR population, suggesting it acts as a "default". Since PUR is an admixed population, does this imply the model learns that a state of high uncertainty or mixed/missing signals is most characteristic of admixed genomes in the training set? Could this "default" behavior be mitigated by training with a null or "uncertain" class?

Read the original source
Version published to 10.1101/2025.08.26.672448 on bioRxiv
Aug 31, 2025

Cross-ancestry information transfer framework improves protein abundance prediction and protein-trait association identification

This article has 13 authors:
1. Wenli Zhai
2. Lingyun Sun
3. Wenwei Fang
4. Yidan Dong
5. Chunxiao Cheng
6. Yuanjiao Liu
7. Yuan Zhou
8. Jiadong Ji
9. Lang Wu
10. An Pan
11. Eric R. Gamazon
12. Xiong-Fei Pan
13. Dan Zhou
This article has no evaluationsLatest version Aug 16, 2025
Polygenic prediction of phenotypes with a neural empirical Bayes approach

This article has 3 authors:
1. Joshua Weinstock
2. April Kim
3. Alexis Battle
This article has no evaluationsLatest version Jul 22, 2025
Improving polygenic risk prediction performance through integrating electronic health records by phenotype embedding

This article has 8 authors:
1. Leqi Xu
2. Wangjie Zheng
3. Jiaqi Hu
4. Yingxin Lin
5. Jia Zhao
6. Gefei Wang
7. Tianyu Liu
8. Hongyu Zhao
This article has no evaluationsLatest version Aug 7, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

Cross-ancestry information transfer framework improves protein abundance prediction and protein-trait association identification

Polygenic prediction of phenotypes with a neural empirical Bayes approach

Improving polygenic risk prediction performance through integrating electronic health records by phenotype embedding