LD Matrix Approximations for Scalable Analysis of High-dimensional Genetic Data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Linkage disequilibrium (LD) matrices are an essential part of many statistical genetics methods. However, their high dimensionality makes their computation and storage impractical for large genomic data. Common sparse approximations, such as banded matrices, come at the expense of losing the positive semi-definite (PSD) property, a critical quality that ensures numerical stability of many downstream analyses. Conversely, methods that guarantee a PSD approximation, like block-diagonal approaches, require coarse approximations of the LD structure. In this work, we present a novel method to approximate an LD matrix with a sparse, banded matrix that is guaranteed to be PSD while preserving the correlation structure within the band. This is done via a reformulation of the nearest correlation matrix problem using the Cholesky decomposition, which implicitly imposes the PSD property in a highly scalable parallel approach. On whole-chromosome data from the 1000 Genomes Project and the UK Biobank, our method builds sparse positive semi-definiteness that are more more accurate than either block-diagonal or shrinkage estimators.