Dimensionality reduction of genetic data using contrastive learning

Filip Thor
Carl Nettelblad

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

We introduce a framework for using contrastive learning for dimensionality reduction on genetic datasets to create principal component analysis (PCA)-like population visualizations. Contrastive learning is a self-supervised deep learning method that uses similarities between samples to train the neural network to discriminate between samples. Many of the advances in these types of models have been made for computer vision, but some common methodology does not translate well from image to genetic data. We define a loss function that outperforms loss functions commonly used in contrastive learning, and a data augmentation scheme tailored specifically towards SNP genotype datasets. We compare the performance of our method to PCA and contemporary nonlinear methods with respect to how well they preserve local and global structure, and how well they generalize to new data. Our method displays good preservation of global structure and has improved generalization properties over t-distributed stochastic neighbor embedding, Uniform Manifold Approximation and Projection, and popvae, while preserving relative distances between individuals to a high extent. A strength of the deep learning framework is the possibility of projecting new samples and fine-tuning to new datasets using a pretrained model without access to the original training data, and the ability to incorporate more domain-specific information in the model. We show examples of population classification on two datasets of dog and human genotypes.

Version published to 10.1093/genetics/iyaf068
Apr 7, 2025
Version published to 10.1101/2024.09.30.615901v3 on bioRxiv
Mar 27, 2025
Version published to 10.1101/2024.09.30.615901v2 on bioRxiv
Jan 21, 2025
Arcadia Science
Oct 4, 2024

Discussion

You have chosen to validate this approach against t-SNE and PCA, both of which, as you point out, they consider loci independently and shouldn't capture linkage or higher order relationships between alleles during the compression. However, the Autoencoder frameworks you have mentioned should capture these relationships. Have you directly compared your approach to an autoencoder framework using the same metrics you use for comparison to PCA and t-SNE?

Read the original source
Arcadia Science
Oct 4, 2024

convolution

It seems totally reasonable to reduce the complexiy and parameter number in these early layers using convolution since you expect correlated structure between proximal polymorphisms (linkage). However, unlike images where the immediate physical proximity (number of pixels away you are) is proportional to the shared information, genetic distance is usually not directly proportional to physical distance and can vary dramatically across the genome. However, for many speceis, we know the genetic map and thus the relationship between genetic distance and physical distance. So, I wonder if you have considered an architecture (such as a graph neural network) that could capture the genetic distance and create convolutions based on this ?

Read the original source
Version published to 10.1101/2024.09.30.615901v1 on bioRxiv
Oct 2, 2024

A Novel Differential Loss Function for Enhancing Generalization in Machine Learning Models

This article has 1 author:
1. Eyas Gaffar A. Osman
This article has no evaluationsLatest version May 13, 2025
Personalized Cancer Diagonisis Using Genetic Dataset and ML Models

This article has 3 authors:
1. B. D.K Patro
2. Gunjan Sengar
3. Shalinee sahu
This article has no evaluationsLatest version May 12, 2025
Lightweight Self-Supervised Representation Learning with Knowledge Distillation on Compact Datasets

This article has 1 author:
1. Khawla Hussein ِAli
This article has no evaluationsLatest version Jun 25, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

A Novel Differential Loss Function for Enhancing Generalization in Machine Learning Models

Personalized Cancer Diagonisis Using Genetic Dataset and ML Models

Lightweight Self-Supervised Representation Learning with Knowledge Distillation on Compact Datasets