Protein Dimension DB: A Unified Protein Repository for Representation Learning and Functional Analysis

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Inspired by the success of large language models in areas like natural language processing, researchers have applied similar architectures, notably the Transformer, to protein sequences. Thanks to these developments, Protein Language Models (PLMs) have become important resources for diverse tasks such as predicting protein family, function, solubility, cellular location, molecular interactions and remote homology. However, the size of the best performing PLMs (which can be up to 15B parameters) requires substantial computational power. Protein Dimension DB addresses this critical bottleneck by providing a centralized, version-controlled resource of precomputed protein embeddings, experimentally validated molecular function annotations, and taxonomic encodings. The database integrates embeddings from seven state-of-the-art PLMs, including ProtT5, ESM2, and Ankh variants for all Swiss-Prot/ UniProt proteins. These models were compared by benchmarking molecular function prediction. Tests revealed that hybrid embeddings (e.g., Ankh Base + ProtT5) outperformed single-model approaches with minimal dimensionality increases. Taxonomic encodings further boosted performance by 2.9% AUPRC, demonstrating lineage-aware learning. By providing embeddings in Parquet format — a columnar storage optimized for machine learning workflows — the resource eliminates GPUdependent preprocessing and reduces storage requirements. This enables immediate use in resource-constrained environments while maintaining backward compatibility through versioned releases. All datasets are freely accessible via Github and HuggingFace, with unified metadata enabling applications from functional annotation to evolutionary studies. Protein Dimension DB bridges the gap between cutting-edge PLMs and practical biological research, offering researchers standardized inputs for reproducible, multi-modal protein analysis.

Article activity feed