Ancestral Sequences Cannot be Accurately Reconstructed via Interpolation in a Variational Autoencoder’s Latent Space

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Standard methods for ancestral sequence reconstruction (ASR) rely on substitution models for the residues in a biological sequence and assume independent evolution across these sites, ignoring the epistatic interactions that shape molecular evolution. In contrast, deep learning models like variational autoencoders (VAEs) can learn low-dimensional representations (“embeddings”) of sequences in a protein family that may implicitly handle these dependencies, raising the possibility of performing more accurate ASR by interpolating between extant sequence embeddings within the VAE’s latent space. In this study, we test this hypothesis by developing and evaluating a VAE-based ASR pipeline. Benchmarking this approach against established likelihood-based and parsimony methods using various simulations of protein evolution, including scenarios with and without epistasis, we find that the VAE-based approach is consistently and significantly outperformed by standard methods, even in epistatic regimes where it was hypothesized to have an advantage. We further show that this failure is not due to a lack of phylogenetic signal in the latent space, which does recapitulate evolutionary structure. Rather, the primary limitation is the information loss inherent to the autoencoding process: the VAE’s decoder cannot reconstruct sequences with sufficient fidelity for the precise demands of ASR.

Article activity feed