Scaling down protein language modeling with MSA Pairformer
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
Recent efforts in protein language modeling have focused on scaling single-sequence models and their training data, requiring vast compute resources that limit accessibility. Although models that use multiple sequence alignments (MSA), such as MSA Transformer, offer parameter-efficient alternatives by extracting evolutionary information directly from homologous sequences rather than storing it in parameters, they generally underperform compared to single-sequence-based language due to memory inefficiencies that limit the number of sequences and averaging evolutionary signals across the MSA. We address these challenges with MSA Pairformer, a 111M parameter memory-efficient MSA-based protein language model that extracts evolutionary signals most relevant to a query sequence through bi-directional updates of sequence and pairwise representations. MSA Pairformer achieves state-of-the-art performance in unsupervised contact prediction, outperforming ESM2-15B by 6% points while using two orders of magnitude fewer parameters. In predicting contacts at protein-protein interfaces, MSA Pair-former substantially outperforms all methods with a 24% point increase over MSA Transformer. Unlike single-sequence models that deteriorate in variant effect prediction as they scale, MSA Pairformer maintains strong performance in both tasks. Ablation studies reveal triangle operations remove indirect correlations, and unlike MSA Transformer, MSA Pairformer does not hallucinate contacts after removing covariance, enabling reliable screening of interacting sequence pairs. Overall, our work presents an alternative to the current scaling paradigm in protein language modeling, enabling efficient adaptation to rapidly expanding sequence databases and opening new directions for biological discovery.
Article activity feed
-
MSAs can now be constructed in milliseconds [58]. As MSA generation methods continue to improve, models that efficiently leverage the rapidly growing set of available sequences, and thus richer evolutionary context, are well-positioned to advance protein language modeling toward a more sustainable future
Totally agree, and its great to see this properly leveraged in the model. At the same time, this got me thinking that not all MSAs are created equally. Scalable methods (e.g., HMM-based or k-mer–based approaches) produce alignments at the scale required for these models, but these are quite different from the phylogenetics-grade MSAs carefully curated for evolutionary inference, which often incorporate clade-specific substitution models, manual curation, etc.
To me, this raises the question that I think deserves investigation: since …
MSAs can now be constructed in milliseconds [58]. As MSA generation methods continue to improve, models that efficiently leverage the rapidly growing set of available sequences, and thus richer evolutionary context, are well-positioned to advance protein language modeling toward a more sustainable future
Totally agree, and its great to see this properly leveraged in the model. At the same time, this got me thinking that not all MSAs are created equally. Scalable methods (e.g., HMM-based or k-mer–based approaches) produce alignments at the scale required for these models, but these are quite different from the phylogenetics-grade MSAs carefully curated for evolutionary inference, which often incorporate clade-specific substitution models, manual curation, etc.
To me, this raises the question that I think deserves investigation: since the model was trained on cheap-to-make MSAs, would inference on the highest of quality MSAs improve the model’s performance? Or because such an MSA would represent a slight departure from the model’s training distribution, would we expect the model to perform worse on this "superior" input?
-