Multiple instance fine-mapping: predicting causal regulatory variants with a deep sequence model

Alexander Rakowski
Christoph Lippert

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Identifying causal genetic variants in a computational manner remains an open problem. Training end-to-end prediction models is not possible without large ground-truth datasets, while results of genome-wide association studies (GWAS) are entangled by linkage disequilibrium (LD), and gene expression datasets do not contain genetic variation at individual-level. Here, we propose Multiple Instance Fine-mapping (MIFM) – a multiple instance learning (MIL) objective to overcome the lack of strong labels by grouping putatively causal variants together based on their LD scores. Using MIFM, we trained a deep classifier on a dataset aggregating over 13, 000 GWAS to predict causal variants based on their underlying DNA sequences. We validated variants prioritized by MIFM by constructing polygenic risk scores which transferred better to different target ancestries. Furthermore, we demonstrated how MIFM can be used to disentangle effect sizes of highly-correlated variants to better fine-map GWAS results.

Author summary

Genome-wide association studies have identified tens of thousands genetic variants associated with traits or diseases. However, the majority of identified variants is only spuriously correlated with the phenotype of interest, having no causal effect on it. Instead, these variants are often inherited together with nearby biologically causal variants, thus creating the spurious associations. Fine-mapping, i.e., predicting which variants are causal, is crucial for downstream tasks, such as uncovering the biological mechanisms affecting the phenotype or robustly identifying individuals with high genetic risk of a disease. While most fine-mapping methods are based on the available association statistics or functional annotations of genetic regions, it should be possible to identify causal variants based on their neighboring DNA sequences. However, training a standard machine learning classifier for that task is obstructed by the scarcity of strong, ground-truth labels. Here, we proposed a method to train sequence models predicting variant causality using weakly-labeled data. We trained a model on a large set of associated variants, and demonstrated its utility by improving cross-ancestry predictions of genetic risk, or disentangling the effect sizes of highly correlated variants.

Version published to 10.1101/2025.06.13.25329551 on medRxiv
Jun 14, 2025

Causal splicing variants revealed by deep-learning integration of single-cell sQTL mapping under influenza infection

This article has 8 authors:
1. Liuyang Wang
2. Guinevere Connelly
3. Trisha Dalapati
4. Angela Jones
5. Benjamin Schott
6. Joseph Trimarco
7. Nicholas Heaton
8. Dennis Ko
This article has no evaluationsLatest version Jan 6, 2026
Path-Probability Models Outperform Point-Estimate Scores for Noncoding GWAS Gene Prioritization

This article has 1 author:
1. Abduxoliq Ashuraliyev
This article has no evaluationsLatest version Dec 22, 2025
Decoding Complex Genotype-Phenotype Interactions by Discretizing the Genome

This article has 6 authors:
1. Jędrzej Kubica
2. Hetvi Jethwani
3. Krzysztof H. Banecki
4. Mauricio Moldes
5. Dariusz Plewczynski
6. Ben Busby
This article has no evaluationsLatest version Dec 17, 2025

Discuss this preprint

Listed in

Abstract

Author summary

Article activity feed

Related articles

Causal splicing variants revealed by deep-learning integration of single-cell sQTL mapping under influenza infection

Path-Probability Models Outperform Point-Estimate Scores for Noncoding GWAS Gene Prioritization

Decoding Complex Genotype-Phenotype Interactions by Discretizing the Genome