LD Matrix Approximations for Scalable Analysis of High-dimensional Genetic Data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Linkage disequilibrium (LD) matrices are an essential part of many statistical genetics methods. However, their high dimensionality makes their computation and storage impractical for large genomic data. Common sparse approximations, such as banded matrices, come at the expense of losing the positive semi-definite (PSD) property, a critical quality that ensures numerical stability of many downstream analyses. Conversely, methods that guarantee a PSD approximation, like block-diagonal approaches, require coarse approximations of the LD structure. In this work, we present a novel method to approximate an LD matrix with a sparse, banded matrix that is guaranteed to be PSD while preserving the correlation structure within the band. This is done via a reformulation of the nearest correlation matrix problem using the Cholesky decomposition, which implicitly imposes the PSD property in a highly scalable parallel approach. On whole-chromosome data from the 1000 Genomes Project and the UK Biobank, our method builds sparse positive semi-definiteness that are more more accurate than either block-diagonal or shrinkage estimators.

Article activity feed