Scaling down protein language modeling with MSA Pairformer

Yo Akiyama
Zhidian Zhang
Milot Mirdita
Martin Steinegger
Sergey Ovchinnikov

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent efforts in protein language modeling have focused on scaling single-sequence models and their training data, requiring vast compute resources that limit accessibility. Although models that use multiple sequence alignments (MSA), such as MSA Transformer, offer parameter-efficient alternatives by extracting evolutionary information directly from homologous sequences rather than storing it in parameters, they generally underperform compared to single-sequence-based language due to memory inefficiencies that limit the number of sequences and averaging evolutionary signals across the MSA. We address these challenges with MSA Pairformer, a 111M parameter memory-efficient MSA-based protein language model that extracts evolutionary signals most relevant to a query sequence through bi-directional updates of sequence and pairwise representations. MSA Pairformer achieves state-of-the-art performance in unsupervised contact prediction, outperforming ESM2-15B by 6% points while using two orders of magnitude fewer parameters. In predicting contacts at protein-protein interfaces, MSA Pair-former substantially outperforms all methods with a 24% point increase over MSA Transformer. Unlike single-sequence models that deteriorate in variant effect prediction as they scale, MSA Pairformer maintains strong performance in both tasks. Ablation studies reveal triangle operations remove indirect correlations, and unlike MSA Transformer, MSA Pairformer does not hallucinate contacts after removing covariance, enabling reliable screening of interacting sequence pairs. Overall, our work presents an alternative to the current scaling paradigm in protein language modeling, enabling efficient adaptation to rapidly expanding sequence databases and opening new directions for biological discovery.

Version published to 10.1101/2025.08.02.668173 on bioRxiv
Aug 3, 2025

SSAlign: Ultrafast and Sensitive Protein Structure Search at Scale

This article has 4 authors:
1. Lei Wang
2. Xuchao Zhang
3. Yan Wang
4. Zhidong Xue
This article has no evaluationsLatest version Jul 5, 2025
ProteinReasoner: A Multi-Modal Protein Language Model with Chain-of-Thought Reasoning for Efficient Protein Design

This article has 9 authors:
1. Chaozhong Liu
2. Linlin Chao
3. Shaomin Ji
4. Hao Wang
5. Taorui Jiang
6. Zhangyang Gao
7. Yucheng Guo
8. Ming Yang
9. Xiaoming Zhang
This article has no evaluationsLatest version Jul 24, 2025
BioMatics 1.0: A Wasserstein Distance Approach for Next-Generation Multiple Sequence Alignment

This article has 4 authors:
1. Orkid Coskuner-Weber
2. Yusuf Ari
3. Yildiray Berberoglu
4. Vladimir Uversky
This article has no evaluationsLatest version Jul 4, 2025

Listed in

Abstract

Article activity feed

Related articles

SSAlign: Ultrafast and Sensitive Protein Structure Search at Scale

ProteinReasoner: A Multi-Modal Protein Language Model with Chain-of-Thought Reasoning for Efficient Protein Design

BioMatics 1.0: A Wasserstein Distance Approach for Next-Generation Multiple Sequence Alignment