Do Protein Language Models Learn Phylogeny?
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Deep machine learning demonstrates a capacity to uncover evolutionary relationships directly from protein sequences, in effect internalising notions inherent to classical phylogenetic tree inference. We connect these two paradigms by assessing the capacity of protein-based language models (pLMs) to discern phylogenetic relationships without being explicitly trained to do so. We evaluate ESM2, ProtTrans and MSA-Transformer relative to classical phylogenetic methods, while also considering sequence insertions and deletions (indels) across 114 Pfam datasets. The largest ESM2 model tends to outperform other pLMs (including the multimodal ESM3) by recovering phylogenetic relationships among homologous protein sequences in both low- and high-gap settings. pLMs agree with conventional phylogenetic methods in general, but more so for protein families with fewer implied indels, highlighting indels as a key factor differentiating classical phylogenetics from pLMs. We find that pLMs preferentially capture broader as opposed to finer evolutionary relationships within a specific protein family, where ESM2 has a sweet spot for highly divergent sequences, at remote distance. Less than 10% of neurons are sufficient to broadly recapitulate classical phylogenetic distances; when used in isolation the difference between the paradigms is further diminished. We show these neurons are polysemantic, shared among different homologous families but never fully overlapping. We highlight the potential of ESM2 as a complementary tool for phylogenetic analysis, especially when extending to remote homologs that are difficult to align and imply complex histories of insertions and deletions.