Integration of protein and coding sequences enables mutual augmentation of the language model

Heng-Rui Zhao
Meng-Ting Cheng
Jinhua Zhu
Hao Wang
Xiang-Rui Yang
Bo Wang
Yuan-Xin Sun
Ming-Hao Fang
Enhong Chen
Houqiang Li
Shu-Jing Han
Yuxing Chen
Cong-Zhao Zhou

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent language models have significantly accelerated our understanding on the massive biological data, using protein or DNA/RNA sequences as a single-language modality. Here we present a dual-language foundation model, which integrates both protein and coding sequences (CDS) for pre-training. Compared to the benchmark models, it shows a superior performance up to ∼20% on both protein and mRNA-related discriminative tasks, and gains the capacity to de novo generate coding sequences of ∼50% increased protein yield. Moreover, the model also possesses the knowledge transferability from the pre-training data to the upstream 5’ untranslated regions. These findings indicate the intrinsic correlations between protein and its CDS, as well as the coding region and beyond. It provides a new paradigm that leverages the multiple-language foundation model to interpret the hidden context of distinct corpora/biological languages, which could be further applied to mine the yet-unknown biological information/correlation beyond the Central Dogma.

Version published to 10.1101/2024.10.24.620004 on bioRxiv
Oct 29, 2024

Pretrained protein language models choose between sequence novelty and structural completeness

This article has 3 authors:
1. Arjuna M. Subramanian
2. Zachary A. Martinez
3. Matt Thomson
This article has no evaluationsLatest version Oct 3, 2025
PEPE: Scalable extraction of multi-modal protein language model representations

This article has 6 authors:
1. Jahn Zhong
2. Niccolò Cardente
3. Geir Kjetil Sandve
4. Habib Bashour
5. Maria Francesca Abbate
6. Victor Greiff
This article has no evaluationsLatest version Oct 14, 2025
NucleicBERT: Deciphering the language of nucleic acids by a large-language model

This article has 4 authors:
1. Utkarsh Upadhyay
2. Julian Herold
3. Markus Götz
4. Alexander Schug
This article has no evaluationsLatest version Sep 6, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Pretrained protein language models choose between sequence novelty and structural completeness

PEPE: Scalable extraction of multi-modal protein language model representations

NucleicBERT: Deciphering the language of nucleic acids by a large-language model