EXTENDING PROTEIN LANGUAGE MODELS TO A VIRAL GENOMIC SCALE USING BIOLOGICALLY INDUCED SPARSE ATTENTION

Thibaut Dejean
Barbra D. Ferrell
William Harrigan
Zachary D. Schreiber
Rajan Sawhney
K. Eric Wommack
Shawn W. Polson
Mahdi Belcaid

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The transformer architecture in deep learning has revolutionized protein sequence analysis. Recent advancements in protein language models have paved the way for significant progress across various domains, including protein function and structure prediction, multiple sequence alignments and mutation effect prediction. A protein language model is commonly trained on individual proteins, ignoring the interdependencies between sequences within a genome. However, biological understanding reveals that protein–protein interactions span entire genomic regions, underscoring the limitations of focusing solely on individual proteins. To address these limitations, we propose a novel approach that extends the context size of transformer models across the entire viral genome. By training on large genomic fragments, our method captures long-range interprotein interactions and encodes protein sequences with integrated information from distant proteins within the same genome, offering substantial benefits in various tasks. Viruses, with their densely packed genomes, minimal intergenic regions, and protein annotation challenges, are ideal candidates for genome-wide learning. We introduce a long-context protein language model, trained on entire viral genomes, leveraging a sparse attention mechanism based on protein–protein interactions. Our semi-supervised approach supports long sequences of up to 61,000 amino acids (aa). Our evaluations demonstrate that the resulting embeddings significantly surpass those generated by single-protein models and outperform alternative large-context architectures that rely on static masking or non-transformer frameworks.

Version published to 10.1101/2025.05.29.656907v1 on bioRxiv
Jun 1, 2025

Fine-Tuning Protein Language Models Unlocks the Potential of Underrepresented Viral Proteomes

This article has 8 authors:
1. Rajan Sawhney
2. Barbra Ferrell
3. Thibaut Dejean
4. Zachary Schreiber
5. William Harrigan
6. Shawn W. Polson
7. K. Eric Wommack
8. Mahdi Belcaid
This article has no evaluationsLatest version Apr 23, 2025
Enhancing Structure-aware Protein Language Models with Efficient Fine-tuning for Various Protein Prediction Tasks

This article has 6 authors:
1. Yichuan Zhang
2. Yongfang Qin
3. Mahdi Pourmirzaei
4. Qing Shao
5. Duolin Wang
6. Dong Xu
This article has no evaluationsLatest version Apr 26, 2025
Functional alignment of protein language models via reinforcement learning

This article has 6 authors:
1. Nathaniel Blalock
2. Srinath Seshadri
3. Agrim Babbar
4. Sarah A Fahlberg
5. Ameya Kulkarni
6. Philip A Romero
This article has no evaluationsLatest version May 8, 2025

Listed in

Abstract

Article activity feed

Related articles

Fine-Tuning Protein Language Models Unlocks the Potential of Underrepresented Viral Proteomes

Enhancing Structure-aware Protein Language Models with Efficient Fine-tuning for Various Protein Prediction Tasks

Functional alignment of protein language models via reinforcement learning