A Transparent and Generalizable Deep Learning Framework for Genomic Ancestry Prediction

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

1

Accurately capturing genetic ancestry is critical for ensuring reproducibility and fairness in genomic studies and downstream health research. This study aims to address the prediction of ancestry from genetic data using deep learning, with a focus on generalizability across datasets with diverse populations and on explainability to improve model transparency. We adapt the Diet Network, a deep learning architecture proven effective in handling high-dimensional data, to learn population ancestry from single nucleotide polymorphisms (SNPs) data using the populational Thousand Genomes Project dataset. Our results highlight the model’s ability to generalize to diverse populations in the CARTaGENE and Montreal Heart Institute biobanks and that predictions remain robust to high levels of missing SNPs. We show that, despite the lack of North African populations in the training dataset, the model learns latent representations that reflect meaningful population structure for North African individuals in the biobanks. To improve model transparency, we apply Saliency Maps, DeepLift, GradientShap and Integrated Gradients attribution techniques and evaluate their performance in identifying SNPs leveraged by the model. Using DeepLift, we show that model’s predictions are driven by population-specific signals consistent with those identified by traditional population genetics metrics. This work presents a generalizable and interpretable deep learning framework for genetic ancestry inference in large-scale biobanks with genetic data. By enabling more widespread genomic ancestry characterization in these cohorts, this study contributes practical tools for integrating genetic data into downstream biomedical applications, supporting more inclusive and equitable healthcare solutions.

Article activity feed

  1. You observed that for ambiguous cases or high-levels of missing data, the model tended to predict the PUR population, suggesting it acts as a "default". Since PUR is an admixed population, does this imply the model learns that a state of high uncertainty or mixed/missing signals is most characteristic of admixed genomes in the training set? Could this "default" behavior be mitigated by training with a null or "uncertain" class?