Fast Probabilistic Whitening Transformation for Ultra-High Dimensional Genetic Data
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Statistical methods often make assumptions about independence between the samples or features of a dataset. Yet correlation structure is ubiquitous in real data, so these assumptions are often not met in practice. Whitening transformations are widely applied to remove this correlation structure. Existing approaches to whitening are based on standard linear algebra, rather than a probabilistic model, and application to high dimensional datasets with n samples and p features is problematic as p approaches or exceeds n . Moreover, the computational time becomes prohibitive since the naive transform is cubic in p . Here we propose a probabilistic model for data whitening and examine its properties based on first principles as p increases. We demonstrate the statistical properties of the probabilistic model and derive a remarkably efficient algorithm that is linear instead of cubic time in the number of features. We examine the out-of-sample performance of the probabilistic whitening model on simulated data, as well as real gene expression and genotype data. In an application to impute z-statistics from unobserved genetic variants from a genome-wide association study of schizophrenia, the probabilistic whitening transformation, implemented in our open source R package decorrelate , had the lowest mean square error while being up to an order of magnitude faster than other methods.