PepBERT: Lightweight language models for peptide representation

Zhenjiao Du
Yonghui Li

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Protein language models (pLMs) have been widely adopted for various protein and peptide-related downstream tasks and demonstrated promising performance. However, short peptides are significantly underrepresented in commonly used pLM training datasets. For example, only 2.8% of sequences in the UniProt Reference Cluster (UniRef) contain fewer than 50 residues, which potentially limit the effectiveness of pLMs for peptide-specific applications. Here, we present PepBERT, a lightweight and efficient peptide language model specifically designed for encoding peptide sequences. Two versions of the model, PepBERT-large (4.9 million parameters) and PepBERT-small (1.86 million parameters), were pretrained from scratch using four custom peptide datasets and evaluated on nine peptide-related downstream prediction tasks. Both PepBERT models achieved performance superior or comparable to the benchmark model, ESM-2 with 7.5 million parameters, on 8 out of 9 datasets. Overall, PepBERT provides a compact yet effective solution for generating high-quality peptide representations for downstream applications such as bioactive peptide screening and drug discovery. The datasets, source codes, pretrained models, and tutorials for usage of PepBERT are available at https://github.com/dzjxzyd/PepBERT-large .

Version published to 10.1101/2025.04.08.647838v1 on bioRxiv
Apr 14, 2025

PepSeek: Universal Functional Peptide Discovery with Cooperation Between Specialized Deep Learning Models and Large Language Model

This article has 14 authors:
1. Haifan Gong
2. Yue Wang
3. Qingzhou Kong
4. Xiaojuan Li
5. Lixiang Li
6. Boyao Wan
7. Yinuo Zhao
8. Jinghui Zhang
9. Guanqi Chen
10. Jiaxin Chen
11. Yanbo Yu
12. Xiaoyun Yang
13. Xiuli Zuo
14. Yanqing Li
This article has no evaluationsLatest version Apr 30, 2025
Extending Prot2Token: Aligning Protein Language Models for Unified and Diverse Protein Prediction Tasks

This article has 7 authors:
1. Mahdi Pourmirzaei
2. Ye Han
3. Farzaneh Esmaili
4. Mohammadreza Pourmirzaei
5. Salhuldin Alqarghuli
6. Kai Chen
7. Dong Xu
This article has no evaluationsLatest version Mar 11, 2025
Enhancing Structure-aware Protein Language Models with Efficient Fine-tuning for Various Protein Prediction Tasks

This article has 6 authors:
1. Yichuan Zhang
2. Yongfang Qin
3. Mahdi Pourmirzaei
4. Qing Shao
5. Duolin Wang
6. Dong Xu
This article has no evaluationsLatest version Apr 26, 2025

Listed in

Abstract

Article activity feed

Related articles

PepSeek: Universal Functional Peptide Discovery with Cooperation Between Specialized Deep Learning Models and Large Language Model

Extending Prot2Token: Aligning Protein Language Models for Unified and Diverse Protein Prediction Tasks

Enhancing Structure-aware Protein Language Models with Efficient Fine-tuning for Various Protein Prediction Tasks