Biologically informed variational autoencoders allow predictive modeling of genetic and drug-induced perturbations

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Motivation

Variational autoencoders (VAEs) have rapidly increased in popularity in biological applications and have already successfully been used on many omic datasets. Their latent space provides a low-dimensional representation of input data, and VAEs have been applied, e.g. for clustering of single-cell transcriptomic data. However, due to their non-linear nature, the patterns that VAEs learn in the latent space remain obscure. Hence, the lower-dimensional data embedding cannot directly be related to input features.

Results

To shed light on the inner workings of VAE and enable direct interpretability of the model through its structure, we designed a novel VAE, OntoVAE (Ontology guided VAE) that can incorporate any ontology in its latent space and decoder part and, thus, provide pathway or phenotype activities for the ontology terms. In this work, we demonstrate that OntoVAE can be applied in the context of predictive modeling and show its ability to predict the effects of genetic or drug-induced perturbations using different ontologies and both, bulk and single-cell transcriptomic datasets. Finally, we provide a flexible framework, which can be easily adapted to any ontology and dataset.

Availability and implementation

OntoVAE is available as a python package under https://github.com/hdsu-bioquant/onto-vae.

Article activity feed

  1. The resulting top terms after this trimming define the latent space.

    What is the expected distribution of weights in the latent space? Would a discriminator network to impose different distributions be useful here?

  2. o verify the validity of these predictions, we performed a gene-set enrichment analysis (GSEA) using as a ground truth the differentially expressed genes in a recently published dataset of bulk RNA-seq carried out on muscle samples from LGMD patients (n=16) and healthy individuals (n=15)25, where we had determined the genes that were significantly up- (LGMD_up) or downregulated (LGMD_dn) in patients compared to age-matched controls (Supplementary Table 3).

    The significance of the difference in gene expression will be related to the size of the effect on the expression, but many genes that influence a phenotype may only show small changes in expression level. How well does this model deal with genes that show small changes in expression? Would this miss genes that show small changes in expression but are nevertheless important?

  3. o verify the validity of these predictions, we performed a gene-set enrichment analysis (GSEA) using as a ground truth the differentially expressed genes in a recently published dataset of bulk RNA-seq carried out on muscle samples from LGMD patients (n=16) and healthy individuals (n=15)25, where we had determined the genes that were significantly up- (LGMD_up) or downregulated (LGMD_dn) in patients compared to age-matched controls (Supplementary Table 3).

    The significance of the difference in gene expression will be related to the size of the effect on the expression, but many genes that influence a phenotype may only show small changes in expression level. How well does this model deal with genes that show small changes in expression? Would this miss genes that show small changes in expression but are nevertheless important?

  4. The resulting top terms after this trimming define the latent space.

    What is the expected distribution of weights in the latent space? Would a discriminator network to impose different distributions be useful here?