Protein language models are biased by unequal sequence sampling across the tree of life

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Protein language models (pLMs) trained on large protein sequence databases have been used to understand disease and design novel proteins. In design tasks, the likelihood of a protein sequence under a pLM is often used as a proxy for protein fitness, so it is critical to understand what signals likelihoods capture. In this work we find that pLM likelihoods unintentionally encode a species bias: likelihoods of protein sequences from certain species are systematically higher, independent of the protein in question. We quantify this bias and show that it arises in large part because of unequal species representation in popular protein sequence databases. We further show that the bias can be detrimental for some protein design applications, such as enhancing thermostability. These results highlight the importance of understanding and curating pLM training data to mitigate biases and improve protein design capabilities in under-explored parts of sequence space.

Article activity feed

  1. where nj is the raw sequence count for species j, d(i, j) is the time to last common ancestor between species i and j collected from the TimeTree of Life resource (Kumar et al., 2022), and α ∈ R≥0 is a hyperparameter used to scale d appropriately. Under the assumption that mutations occur at a fixed rate, Embedded Image gives the expected overlap in sequence between two species’ orthologs, to approximate the effective sequence counts they contribute to each other4.

    It's great that even with the use of fixed rates you see a substantial increase in fraction of bias explained. Since mutation rates obviously do vary, I wonder just how much better you might do using a model that doesn't explicitly fix them...

  2. Under the assumption that mutations occur at a fixed rate, Embedded Image gives the expected overlap in sequence between two species’ orthologs, to approximate the effective sequence counts they contribute to each other4.

    What does it look like if you just use the branch lengths from the phylogeny to do this weighting? I would guess you get at least some increase in the Spearman correlations and it's a straightforward approach.