Evolutionary Reasoning Does Not Arise in Standard Usage of Protein Language Models

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Log in to save this article

Abstract

Protein language models (PLMs) are often assumed to capture evolutionary information by training on large protein sequence datasets. Yet it remains unclear whether PLMs can reason about evolution—that is, infer evolutionary relationships between sequences. We test this capability by evaluating whether standard PLM usage, frozen or fine-tuned embeddings with distance-based comparison, supports evolutionary reasoning. Existing PLMs consistently fail to recover phylogenetic structure, despite strong performance on sequence-level tasks such as masked-token and contact prediction. We present P hyla , a hybrid state-space and transformer model that jointly processes multiple sequences and is trained using a tree-based objective across 3,000 phylogenies spanning diverse protein families. P hyla outperforms the next-best PLM by 9% on tree reconstruction and 23% on taxonomic clustering while remaining alignment- and guide-tree-free. Although classical alignment pipelines achieve higher absolute accuracy, P hyla narrows the gap and achieves markedly lower end-to-end runtime. Applied to real data, P hyla reconstructs biologically accurate clades in the tree of life and resolves genome-scale relationships among Mycobacterium tuberculosis isolates. These findings suggest that, under standard usage, evolutionary reasoning does not reliably emerge from large-scale sequence modeling. Instead, P hyla shows that models trained with phylogenetic supervision can reason about evolution more effectively, offering a biologically grounded path toward evolutionary foundation models.

Article activity feed

  1. For a set of n sequences with predicted pairwise distances Dpred ∈ ℝn×nand true distances Dtrue ∈ ℝn×n, we sample a set of quartets 𝒬 ={(i, j, k, ℓ)}. For each quartet, we compute three possible pairwise distance sums:

    How many quartet subsets are sampled per observation (tree) during training? Does this depend on the size of the tree?

  2. The input to Phylais S with a [CLS]token concatenated in front of each tokenized sequence, s ∈ S: {[CLS]s1 ∥ [CLS]s2 ∥ [CLS]s3, …, [CLS]sn}.

    So the sequences of a tree are concatenated together, and this concatenated token sequence is what the model operates on. I have two questions about this:

    How are the sequence embeddings calculated from the model output? Mean-pooling over the sequence's token positions, CLS token positions, etc? It would be nice to have this information in the text.

    Are the sequence embeddings invariant with respect to concatenation order? Since the concatenation order has no biological meaning, this seems important to demonstrate.

  3. highlighting the challenge of this dataset

    It's also notable that Phyla just barely outperforms Hamming Distance for the TreeBase dataset but substantially outperforms it with TreeFam. Might this be related to the inherent differences between reconstructing species vs. gene family trees (species trees are often estimated over sets of gene trees)? It could be worthwhile to consider species and gene tree reconstruction as different classes of tasks, one likely much harder than the other.

  4. Robinson-Foulds metric

    Since Robinson-Foulds has some well-known limitations, it would be useful to compare it to other distance metrics (e.g. quartet distance).

  5. The primary limitation of tree reconstruction is runtime inefficiency as tree sizes grow

    It might be worth noting that another limitation of tree reconstruction is that it's inferential and probabilistic; actual "ground truth" relationships are never available. Trees are estimates of relationships given a set of parameters and assumptions. Different methods will estimate relationships differently.