Which pLM to choose?

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Protein-language models (pLMs) provide a novel means for mapping the protein space. Which of these new maps best advances specific biological analyses, however, is not obvious. To elucidate the principles of model selection, we benchmarked fourteen pLMs, spanning several orders of magnitude in parameter count, across a hundred million protein pairs, to assess how well they capture sequence, structure, and function similarity. For each model, we distinguish inherent information, i.e. signal recoverable from raw-embedding distances, and extractable information, i.e. signal revealed through additional supervised training. Three key results emerge. First, pLM protein representation space is inherently different from the space of biological protein representations, i.e. sequences or structures. Here, a size-performance paradox is salient - mid-scale foundation models are as good as much larger ones in reflecting all tested biological properties. Second, pLM representations compress and store biological information in proportion to model size. That is, a lightweight feed-forward network can be trained on embedding pairs to predict said biological properties well - a capacity dividend. Finally, we observe that a task-specific learning radically reshapes the embedding space, gaining inherent understanding of the task, but garbling any further extractions. In other words, smaller pLMs can provide efficient and compute-light general insight. Larger models are advantageous only when fine-tuning is planned to accomplish a specific task. Furthermore, representations generated by "specialist" models are not immediately generalizable throughout protein biology. Thus, for pLMs, bigger isn't always better.

Article activity feed