Unsupervised learning of multi-omics data enables disease risk prediction in the UK Biobank
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The size and complexity of biomedical datasets continue to grow, driving the development of methods that reduce dimensionality while preserving biological signals. Yet, when deep learning is applied to such data, the impact of preprocessing choices and dataset properties on model behavior is often overlooked. Here, we applied our framework Multi-Omics Variational autoEncoder (MOVE) to multiomics data from 452,026 UK Biobank participants, aiming to both evaluate the power of the learned representations for disease risk prediction and critically analyze how non-biological factors, like dataset properties and preprocessing decisions, can shape and influence the results. We show that reducing the dimensionality of the data by a factor of 80 still yields comparable prediction performance across 15 different diseases. We further demonstrate how dataset properties and preprocessing choices impact the model performance, latent representation and downstream results, and our findings strongly underline the need for thorough analysis and understanding of a model’s behavior before drawing conclusions from its results.