Fast Probabilistic Whitening Transformation for Ultra-High Dimensional Genetic Data

Gabriel E. Hoffman
Christian P. Dillard
Kiran Girdhar
Panos Roussos

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Statistical methods often make assumptions about independence between the samples or features of a dataset. Yet correlation structure is ubiquitous in real data, so these assumptions are often not met in practice. Whitening transformations are widely applied to remove this correlation structure. Existing approaches to whitening are based on standard linear algebra, rather than a probabilistic model, and application to high dimensional datasets with n samples and p features is problematic as p approaches or exceeds n . Moreover, the computational time becomes prohibitive since the naive transform is cubic in p . Here we propose a probabilistic model for data whitening and examine its properties based on first principles as p increases. We demonstrate the statistical properties of the probabilistic model and derive a remarkably efficient algorithm that is linear instead of cubic time in the number of features. We examine the out-of-sample performance of the probabilistic whitening model on simulated data, and real genotype data. In an application to impute z-statistics from unobserved genetic variants from a genome-wide association study of schizophrenia, the probabilistic whitening transformation, had the lowest mean square error while being up to an order of magnitude faster than other methods. Using this approach, we also identify tandem repeats that explain genetic regulatory signals for disease-relevant genes. Analyses are implemented in our novel open source R packages decorrelate and imputez .

Version published to 10.1101/2025.09.01.673591 on bioRxiv
Sep 4, 2025

Reframing Population Genetic Structure as a Quantum Optimization Problem

This article has 1 author:
1. Andrew Davinack
This article has no evaluationsLatest version Dec 24, 2025
Nonparametric Learning of Covariate-based Markov Jump Processes Using RKHS Techniques

This article has 3 authors:
1. yuchen han
2. Riten Mitra
3. Arnab Ganguly
This article has no evaluationsLatest version Dec 17, 2025
Fast uncertainty quantification in EZ cognitive models

This article has 2 authors:
1. Joachim Vandekerckhove
2. Elizabeth Fox
This article has no evaluationsLatest version Jan 7, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Reframing Population Genetic Structure as a Quantum Optimization Problem

Nonparametric Learning of Covariate-based Markov Jump Processes Using RKHS Techniques

Fast uncertainty quantification in EZ cognitive models