Towards whole-genome inference of polygenic scores with fast and memory-efficient algorithms
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
With improved Whole Genome Sequencing (WGS) and variant imputation techniques, modern Genome-wide Association Studies (GWASs) have enriched our understanding of the landscape of genetic associations for thousands of disease phenotypes. However, translating the marginal associations for millions of genetic variants to integrated polygenic risk scores (PRS) that capture their joint effects on the phenotype remains a major challenge. Due to technical and statistical constraints, commonly-used PRS methods in this setting either perform heuristic Pruning-and-Thresholding or overlook most genetic association signals by restricting inference to small variant sets, such as HapMap3. Here, we present a set of algorithmic improvements and compact data structures that enable scaling summary statistics-based PRS inference to tens of millions of variants while avoiding numerical instabilities common in such high-dimensional settings. These enhancements consist of highly compressed Linkage-Disequilibrium (LD) matrix format, which integrates with streamlined and parallel coordinate ascent updating schemes. When incorporated into our existing PRS method (VIPRS), the new algorithms yield over 50 fold reductions in storage requirements and lead to orders of magnitude improvements in runtime and memory efficiency. The updated VIPRS software can now perform Variational Bayesian regression over 1.1 million HapMap3 variants in under a minute. Using this new scalable implementation, we applied VIPRS to 75 of the most heritable, continuous phenotypes in the UK Biobank, leveraging marginal associations for up to 18 million bi-allelic variants. Performing inference over this rich association data requires less than 20 minutes of wallclock time and 15GB of memory per phenotype. It also delivers consistent gains in cross-population transferability, with an average improvement of 10-15% in incremental R-squared.