ProtSpace: a tool for visualizing protein space

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Protein language models (pLMs) generate high-dimensional representations of proteins, so called embeddings, that capture complex information stored in the set of evolved sequences. Interpreting these embeddings remains an important challenge. ProtSpace provides one solution through an open-source Python package that visualizes protein embeddings interactively in 2D and 3D. The combination of embedding space with protein 3D structure view aids in discovering functional patterns readily missed by traditional sequence analysis.

We present two examples to showcase ProtSpace . First, investigations of phage data sets showed distinct clusters of major functional groups and a mixed region, possibly suggesting bias in today’s protein sequences used to train pLMs. Second, the analysis of venom proteins revealed unexpected convergent evolution between scorpion and snake toxins; this challenges existing toxin family classifications and added evidence refuting the aculeatoxin family hypothesis .

ProtSpace is freely available as a pip-installable Python package (source code & documentation) with examples on GitHub ( https://github.com/tsenoner/protspace ) and as a web interface ( https://protspace.rostlab.org ). The platform enables seamless collaboration through portable JSON session files.

Article activity feed

  1. own

    We recently released a snakemake pipeline for exploring protein structural spaces, generating using TM-score similarity matrices, called ProteinCartography. It's cool to see your convergent solution to the problem of visualizing protein similarity applied to embeddings!