ProtSpace: Protein Universe in Your Browser
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Protein Language Models (pLMs) generate per-protein embeddings that encode functional, structural, and evolutionary information, yet the relationships captured in these representations remain difficult to explore systematically. ProtSpace ( https://protspace.app ) is a web application for interactive visualization of pLM embedding spaces, enabling hypothesis generation directly in the browser without installation. Unlike traditional network-based tools that exclusively visualize amino acid sequence similarity, ProtSpace explores embedding spaces, revealing relationships often not captured by traditional comparisons. Users provide protein sequences or pre-computed embeddings through a Google Colab notebook or the Python CLI; the pipeline applies dimensionality reduction, retrieves 38 annotation types spanning UniProt, InterPro, NCBI Taxonomy, TED structural domains, and sequence-based predictors served via Biocentral, and produces a portable binary file for the browser-based viewer. WebGL-accelerated rendering supports interactive exploration of over 570,000 proteins. Distinctive features include per-point pie charts for multi-label annotations and integrated 3D structure viewing through AlphaFold2 predictions. All computation happens on the user’s machine, ensuring data privacy. We demonstrate the utility of ProtSpace through a progressive zoom-in across biological scales: from global proteome organization of Swiss-Prot, through cross-species comparison revealing conserved and lineage-specific families, to functional hypothesis generation within the beta-lactamase superfamily. ProtSpace is freely available at https://protspace.app under the Apache 2.0 license.
K ey points
ProtSpace is a free, open-source web application that visualizes protein Language Model (pLM) embeddings as interactive maps, scaling to 570,000 proteins entirely client-side.
A zero-installation Google Colab notebook and a Python CLI prepare visualization-ready bundles from FASTA files, UniProt queries, or pre-computed HDF5 embeddings, automatically retrieving 38 annotation types from five sources (UniProt, InterPro, NCBI Taxonomy, TED structural domains, and Biocentral sequence predictors) alongside custom CSV metadata.
Application examples demonstrate that embedding visualizations generate testable biological hypotheses at multiple scales, from proteome-wide organization through species-level comparison to family-level functional discovery, and that these are complementary to traditional sequence-based analyses.