SAFE-LD: A novel method for the estimation of linkage disequilibrium from summary statistics
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Genome-wide association studies (GWAS) have greatly advanced our understanding of the genetic architecture of complex traits. Downstream analyses of GWAS summary statistics require accurate in-sample LD, the variant correlations in the same individuals used for the GWAS, as even small discrepancies can propagate into substantial error. In practice, privacy and consent restrictions prevent sharing of individual-level genotypes, forcing researchers either to rely on external reference panels, which reduce accuracy and power, or to store and distribute massive precomputed LD matrices that are inflexible and difficult to analyze. Here we introduce SAFE-LD ( Shrinkage and Anonymisation Framework for LD Estimation ), a novel method that produces pseudo-genotypes designed to reproduce the exact in-sample LD of a cohort, while discarding all individual-level genetic content. SAFELD surrogates can be stored in VCF/PGEN formats and used seamlessly with standard pipelines, providing LD estimates indistinguishable from the originals but free from privacy concerns. Using extensive simulations on UK Biobank data, we show that SAFE-LD is robust across genomic regions and population sizes. Notably, SAFE-LD achieves finemapping accuracy on par with internal LD, and significantly outperforms external LD even under best-case conditions with cohort-matched reference panels. We further extend this framework to existing GWAS summary statistics through SAFE-LDss , which exploits existing published summary statistics where numerous traits have been analyzed on the same samples. SAFE-LD offers a scalable, privacy-preserving, and highly accurate alternative to traditional LD estimation, enabling easy sharing and seamless utilization with standard tools. By storing compact pseudo-genotypes instead of massive precomputed LD matrices, it also provides a highly efficient solution in terms of disk space and data management, while safeguarding participant privacy and supporting precise fine-mapping.