Dimensionality reduction of genetic data using contrastive learning
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
We introduce a framework for using contrastive learning for dimensionality reduction on genetic datasets to create principal component analysis (PCA)-like population visualizations. Contrastive learning is a self-supervised deep learning method that uses similarities between samples to train the neural network to discriminate between samples. Many of the advances in these types of models have been made for computer vision, but some common methodology does not translate well from image to genetic data. We define a loss function that outperforms loss functions commonly used in contrastive learning, and a data augmentation scheme tailored specifically towards SNP genotype datasets. We compare the performance of our method to PCA and contemporary nonlinear methods with respect to how well they preserve local and global structure, and how well they generalize to new data. Our method displays good preservation of global structure and has improved generalization properties over t-distributed stochastic neighbor embedding, Uniform Manifold Approximation and Projection, and popvae, while preserving relative distances between individuals to a high extent. A strength of the deep learning framework is the possibility of projecting new samples and fine-tuning to new datasets using a pretrained model without access to the original training data, and the ability to incorporate more domain-specific information in the model. We show examples of population classification on two datasets of dog and human genotypes.
Article activity feed
-
-
-
-
Discussion
You have chosen to validate this approach against t-SNE and PCA, both of which, as you point out, they consider loci independently and shouldn't capture linkage or higher order relationships between alleles during the compression. However, the Autoencoder frameworks you have mentioned should capture these relationships. Have you directly compared your approach to an autoencoder framework using the same metrics you use for comparison to PCA and t-SNE?
-
convolution
It seems totally reasonable to reduce the complexiy and parameter number in these early layers using convolution since you expect correlated structure between proximal polymorphisms (linkage). However, unlike images where the immediate physical proximity (number of pixels away you are) is proportional to the shared information, genetic distance is usually not directly proportional to physical distance and can vary dramatically across the genome. However, for many speceis, we know the genetic map and thus the relationship between genetic distance and physical distance. So, I wonder if you have considered an architecture (such as a graph neural network) that could capture the genetic distance and create convolutions based on this ?
-