PEPE: Scalable extraction of multi-modal protein language model representations
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Motivation
Protein language models (PLMs) have demonstrated significant potential in capturing the complex interaction patterns between amino acids in protein sequences. These models, trained on large datasets of protein sequences, can generate embeddings, high-dimensional numerical representations that encode valuable information about the structure, function, and evolution of proteins. However, conventional usage has largely been based on arbitrarily determined variable sets (embedding modes), including the choice of embedding layer, pooling method, and padding, which can potentially lead to a suboptimal representation with low information content for a given downstream task. The scalability of protein embedding mode extraction is limited by inefficiencies in both space (memory) and time (computation). (i) Accumulating all outputs in memory and writing them to disk in a single operation leads to a memory bottleneck. (ii) Additionally, repeated embedding of the same sequence to extract different embedding modes introduces unnecessary computational overhead and reduces throughput significantly.
Results
Here, we present PEPE (Parallel Extraction for Protein Embeddings), a command-line tool designed for high-throughput multi-modal protein sequence embedding extraction. We demonstrate that PEPE’s parallel process achieves a total run time several orders of magnitude faster than sequential approaches. We also demonstrate how, for a state-of-the-art (SOTA) method, peak memory usage scales with output size and fails once the memory capacity is exceeded, whereas PEPE’s peak memory usage remains consistently below the critical limit, allowing the extraction of multimodal embeddings that exceed the available memory. PEPE supports a wide range of publicly available and custom protein language models, providing a simple command-line interface for researchers. PEPE enables the generation of protein embedding datasets at previously unfeasible scales, facilitating the identification of optimal protein embedding settings for downstream analyses without requiring additional resources for fine-tuning.
Availability and Implementation
PEPE is a command-line tool written in Python and published under MIT license. The source code and documentation are available at https://github.com/csi-greifflab/pepe-cli . PEPE is also available for installation from PyPI under https://pypi.org/project/pepe-cli and deposited on Zenodo at https://zenodo.org/records/15912054 .