PEPE: Scalable extraction of multi-modal protein language model representations

Jahn Zhong
Niccolò Cardente
Geir Kjetil Sandve
Habib Bashour
Maria Francesca Abbate
Victor Greiff

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Motivation

Protein language models (PLMs) have demonstrated significant potential in capturing the complex interaction patterns between amino acids in protein sequences. These models, trained on large datasets of protein sequences, can generate embeddings, high-dimensional numerical representations that encode valuable information about the structure, function, and evolution of proteins. However, conventional usage has largely been based on arbitrarily determined variable sets (embedding modes), including the choice of embedding layer, pooling method, and padding, which can potentially lead to a suboptimal representation with low information content for a given downstream task. The scalability of protein embedding mode extraction is limited by inefficiencies in both space (memory) and time (computation). (i) Accumulating all outputs in memory and writing them to disk in a single operation leads to a memory bottleneck. (ii) Additionally, repeated embedding of the same sequence to extract different embedding modes introduces unnecessary computational overhead and reduces throughput significantly.

Results

Here, we present PEPE (Parallel Extraction for Protein Embeddings), a command-line tool designed for high-throughput multi-modal protein sequence embedding extraction. We demonstrate that PEPE’s parallel process achieves a total run time several orders of magnitude faster than sequential approaches. We also demonstrate how, for a state-of-the-art (SOTA) method, peak memory usage scales with output size and fails once the memory capacity is exceeded, whereas PEPE’s peak memory usage remains consistently below the critical limit, allowing the extraction of multimodal embeddings that exceed the available memory. PEPE supports a wide range of publicly available and custom protein language models, providing a simple command-line interface for researchers. PEPE enables the generation of protein embedding datasets at previously unfeasible scales, facilitating the identification of optimal protein embedding settings for downstream analyses without requiring additional resources for fine-tuning.

Availability and Implementation

PEPE is a command-line tool written in Python and published under MIT license. The source code and documentation are available at https://github.com/csi-greifflab/pepe-cli . PEPE is also available for installation from PyPI under https://pypi.org/project/pepe-cli and deposited on Zenodo at https://zenodo.org/records/15912054 .

Version published to 10.1101/2025.10.13.680902 on bioRxiv
Oct 14, 2025

Protein Dimension DB: A Unified Protein Repository for Representation Learning and Functional Analysis

This article has 3 authors:
1. Pitágoras de Azevedo Alves Sobrinho
2. Tetsu Sakamoto
3. Wilfredo Blanco Figuerola
This article has no evaluationsLatest version Oct 1, 2025
ProtSEC: Ultrafast Protein Sequence Embedding in Complex Space Using Fast Fourier Transform

This article has 2 authors:
1. Rajan Saha Raju
2. Rashedul Islam
This article has no evaluationsLatest version Aug 22, 2025
Pretrained protein language models choose between sequence novelty and structural completeness

This article has 3 authors:
1. Arjuna M. Subramanian
2. Zachary A. Martinez
3. Matt Thomson
This article has no evaluationsLatest version Oct 3, 2025

Discuss this preprint

Listed in

Abstract

Motivation

Results

Availability and Implementation

Article activity feed

Related articles

Protein Dimension DB: A Unified Protein Repository for Representation Learning and Functional Analysis

ProtSEC: Ultrafast Protein Sequence Embedding in Complex Space Using Fast Fourier Transform

Pretrained protein language models choose between sequence novelty and structural completeness