Learning the Language of Phylogeny with MSA Transformer
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Classical phylogenetics assumes site independence, potentially overlooking epistasis. Protein language models capture dependencies in conserved structural and functional domains across the protein universe. Here, we ask whether MSA Transformer, which takes a multiple sequence alignment (MSA) as input, captures evolutionary distance and to what extent its representations reflect epistasis in protein sequence evolution, neither of which are explicitly available during training. Systematic shuffling of natural and simulated MSAs demonstrates the model exploits column-wise conservation to distinguish phylogenetic relationships. Using internal embeddings, we reconstruct trees that are markedly consistent with trees generated by maximum likelihood inference. Applying this approach to both the RNA-dependent RNA polymerase of RNA viruses and the nucleo-cytoplasmic large DNA virus domain, we recover both established and novel evolutionary relationships. We conclude that MSA Transformer complements, rather than replaces, classical inference for more accurate histories of protein families.