PRSformer: Disease Prediction from Million-Scale Individual Genotypes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Predicting disease risk from DNA presents an unprecedented emerging challenge as biobanks approach population scale sizes (N>10 6 individuals) with ultra-high-dimensional features (L>10 5 genotypes). Current methods, often linear and reliant on summary statistics, fail to capture complex genetic interactions and discard valuable individual-level information. We introduce PRSformer , a scalable deep learning architecture designed for end-to-end, multitask disease prediction directly from million-scale individual genotypes. PRSformer employs neighborhood attention, achieving linear O(L) complexity per layer, making Transformers tractable for genome-scale inputs. Crucially, PRSformer utilizes a stacking of these efficient attention layers, progressively increasing the effective receptive field to model local dependencies (e.g., within linkage disequilibrium blocks) before integrating information across wider genomic regions. This design, tailored for genomics, allows PRSformer to learn complex, potentially non-linear and long-range interactions directly from raw genotypes. We demonstrate PRSformer's effectiveness using a unique large private cohort (N≈5M) for predicting 18 autoimmune and inflammatory conditions using L≈140k variants. PRSformer significantly outperforms highly optimized linear models trained on the same individual-level data and state-of-the-art summary-statistic-based methods (LDPred2) derived from the same cohort , quantifying the benefits of non-linear modeling and multitask learning at scale. Furthermore, experiments reveal that the advantage of non-linearity emerges primarily at large sample sizes (N>1M), and that a multi-ancestry trained model improves generalization, establishing PRSformer as a new framework for deep learning in population-scale genomics.

Article activity feed