Learning the Language of Phylogeny with MSA Transformer

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Classical phylogenetics assumes site independence, potentially overlooking epistasis. Protein language models capture dependencies in conserved structural and functional domains across the protein universe. Here, we ask whether MSA Transformer, which takes a multiple sequence alignment (MSA) as input, captures evolutionary distance and to what extent its representations reflect epistasis in protein sequence evolution, neither of which are explicitly available during training. Systematic shuffling of natural and simulated MSAs demonstrates the model exploits column-wise conservation to distinguish phylogenetic relationships. Using internal embeddings, we reconstruct trees that are markedly consistent with trees generated by maximum likelihood inference. Applying this approach to both the RNA-dependent RNA polymerase of RNA viruses and the nucleo-cytoplasmic large DNA virus domain, we recover both established and novel evolutionary relationships. We conclude that MSA Transformer complements, rather than replaces, classical inference for more accurate histories of protein families.

Article activity feed