E1: Retrieval-Augmented Protein Encoder Models

Sarthak Jain
Joel Beazer
Jeffrey A. Ruffolo
Aadyot Bhatnagar
Ali Madani

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models trained on natural proteins learn powerful representations of protein sequences that are useful for downstream understanding and prediction tasks. Because they are only exposed to individual protein sequences during pretraining without any additional contextual information, conventional protein language models suffer from parameter inefficiencies in learning, baked-in phylogenetic biases, and functional performance issues at larger scales. To address these challenges, we have built Profluent-E1, a family of retrieval-augmented protein language models that explicitly condition on homologous sequences. By integrating retrieved evolutionary context through block-causal multi-sequence attention, E1 captures both general and family-specific constraints without fine-tuning. We train E1 models on four trillion tokens from the Profluent Protein Atlas and achieve state-of-the-art performance across zero-shot fitness and unsupervised contact-map prediction benchmarks – surpassing alternative sequence-only models. Performance scales with model size from 150M to 600M parameters, and E1 can be used flexibly in single-sequence or retrieval-augmented inference mode for fitness prediction, variant ranking, and embeddings for structural tasks. To encourage open science and further development in retrieval-augmented protein language models, we release three models for free research and commercial use at https://github.com/Profluent-AI/E1 .

Version published to 10.1101/2025.11.12.688125 on bioRxiv
Nov 13, 2025

Graph attention with structural features improves the generalizability of identifying functional sequences at a protein interface

This article has 6 authors:
1. J. Ash
2. I. M. Francino-Urdaniz
3. S. P. Kells
4. C. N. Davis
5. T. A. Whitehead
6. S. D. Khare
This article has no evaluationsLatest version Nov 10, 2025
FlexRibbon: Joint Sequence and Structure Pretraining for Protein Modeling

This article has 23 authors:
1. Jianwei Zhu
2. Yu Shi
3. Ran Bi
4. Peiran Jin
5. Chang Liu
6. Zhe Zhang
7. Haitao Huang
8. Zekun Guo
9. Pipi Hu
10. Fusong Ju
11. Lin Huang
12. Xinwei Tai
13. Chenao Li
14. Kaiyuan Gao
15. Xinran Wei
16. Huanhuan Xia
17. Jia Zhang
18. Yaosen Min
19. Zun Wang
20. Yusong Wang
21. Liang He
22. Haiguang Liu
23. Tao Qin
This article has no evaluationsLatest version Oct 10, 2025
Protein Dimension DB: A Unified Protein Repository for Representation Learning and Functional Analysis

This article has 3 authors:
1. Pitágoras de Azevedo Alves Sobrinho
2. Tetsu Sakamoto
3. Wilfredo Blanco Figuerola
This article has no evaluationsLatest version Oct 1, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Graph attention with structural features improves the generalizability of identifying functional sequences at a protein interface

FlexRibbon: Joint Sequence and Structure Pretraining for Protein Modeling

Protein Dimension DB: A Unified Protein Repository for Representation Learning and Functional Analysis