Efficient and accurate sequence generation with small-scale protein language models

Yaiza Serrano
Sergi Roda
Victor Guallar
Alexis Molina

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large Language Models (LLMs) have demonstrated exceptional capabilities in understanding contextual relationships, outperforming traditional methodologies in downstream tasks such as text generation and sentence classification. This success has been mirrored in the realm of protein language models (pLMs), where proteins are encoded as text via their amino acid sequences. However, the training of pLMs, which involves tens to hundreds of millions of sequences and hundreds of millions to billions of parameters, poses a significant computational challenge.

In this study, we introduce a Small-Scale Protein Language Model (SS-pLM), a more accessible approach that requires training on merely millions of representative sequences, reducing the number of trainable parameters to 14.8M. This model significantly reduces the computational load, thereby democratizing the use of foundational models in protein studies. We demonstrate that the performance of our model, when fine-tuned to a specific set of sequences for generation, is comparable to that of larger, more computationally demanding pLM.

Version published to 10.1101/2023.08.04.551626 on bioRxiv
Aug 6, 2023

A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025
Emergence of Biological Structural Discovery in General-Purpose Language Models

This article has 1 author:
1. Liang Wang
This article has no evaluationsLatest version Jan 8, 2026
GENERator: A Long-Context Generative Genomic Foundation Model

This article has 18 authors:
1. Qiuyi Li
2. Wei Wu
3. Yuanyuan Zhang
4. Zhihao Zhan
5. Ruipu Chen
6. Mingyang Li
7. Kun Fu
8. Junyan Qi
9. Yongzhou Bao
10. Chao Wang
11. Yiheng Zhu
12. Zhiyun Zhang
13. Jian Tang
14. Fuli Feng
15. Jieping Ye
16. Liu Yuwen
17. Hui Xiong
18. Zheng Wang
This article has no evaluationsLatest version Feb 4, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Survey on Efficient Protein Language Models

Emergence of Biological Structural Discovery in General-Purpose Language Models

GENERator: A Long-Context Generative Genomic Foundation Model