Integration of protein and coding sequences enables mutual augmentation of the language model
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Recent language models have significantly accelerated our understanding on the massive biological data, using protein or DNA/RNA sequences as a single-language modality. Here we present a dual-language foundation model, which integrates both protein and coding sequences (CDS) for pre-training. Compared to the benchmark models, it shows a superior performance up to ∼20% on both protein and mRNA-related discriminative tasks, and gains the capacity to de novo generate coding sequences of ∼50% increased protein yield. Moreover, the model also possesses the knowledge transferability from the pre-training data to the upstream 5’ untranslated regions. These findings indicate the intrinsic correlations between protein and its CDS, as well as the coding region and beyond. It provides a new paradigm that leverages the multiple-language foundation model to interpret the hidden context of distinct corpora/biological languages, which could be further applied to mine the yet-unknown biological information/correlation beyond the Central Dogma.