Dimensionality Reduction of Genetic Data using Contrastive Learning

This article has been Reviewed by the following groups

Read the full article

Listed in

Log in to save this article

Abstract

We introduce a framework for using contrastive learning for dimensionality reduction on genetic datasets to create PCA-like population visualizations. Contrastive learning is an example of a self-supervised deep learning method that uses similarities between samples to train the neural network to discriminate between samples. Much of the advances in these types of models have been made for computer vision, but many of the heuristics developed do not translate well from image to genetic data. We define a loss function that outperforms other basic loss functions used in contrastive learning in our experiments, and a data augmentation scheme tailored specifically towards SNP genotype datasets.

In our experiments, our methods outperform PCA in terms of population classification. It is on par with t-SNE, while also providing greater generalization properties to unseen and missing data. A strength of the deep learning framework is the possibility of projecting new samples using a trained model, and the ability to incorporate more domain-specific information in the model. We show examples of population classification on two datasets of dog and human genotypes.

Article activity feed

  1. Discussion

    You have chosen to validate this approach against t-SNE and PCA, both of which, as you point out, they consider loci independently and shouldn't capture linkage or higher order relationships between alleles during the compression. However, the Autoencoder frameworks you have mentioned should capture these relationships. Have you directly compared your approach to an autoencoder framework using the same metrics you use for comparison to PCA and t-SNE?

  2. convolution

    It seems totally reasonable to reduce the complexiy and parameter number in these early layers using convolution since you expect correlated structure between proximal polymorphisms (linkage). However, unlike images where the immediate physical proximity (number of pixels away you are) is proportional to the shared information, genetic distance is usually not directly proportional to physical distance and can vary dramatically across the genome. However, for many speceis, we know the genetic map and thus the relationship between genetic distance and physical distance. So, I wonder if you have considered an architecture (such as a graph neural network) that could capture the genetic distance and create convolutions based on this ?