A novel method to select Reference Proteomes in UniProt

Pedro Raposo
Juan Sebastian Martinez Marin
Gyuri Kim
Giuseppe Insana
Dushyanth Jyothi
Jie Luo
Tanushree Tunstall
UniProt Consortium
Sandra Orchard
Martin Steinegger
Maria Martin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Motivation

The ongoing revolution in genome sequencing is delivering an unprecedented number of genome assemblies to global repositories, resulting in an overwhelming amount of data imported to UniProt in the form of proteomes. To manage this growth sustainably, there is a need for a systematic workflow to select the best proteomes.

Results

We propose a novel pipeline for cellular organisms to select the best Reference Proteomes, i.e. those that best represent the protein space of a species. The pipeline uses a clustering algorithm based on MMseqs2 to select the minimum number of Reference Proteomes whilst maximising the representation of the protein space for each species. Additionally, we aligned our viral Reference Proteomes with the exemplar genome set defined by the International Committee on Taxonomy of Viruses. Because this method ensures that all species are represented with at least one Reference Proteome, the UniProt Knowledgebase increased the number of Reference Proteomes of 36% and covering 34% more species in the Tree of Life. The UniProt Knowledgebase will mainly retain proteins from Reference Proteomes and therefore this method reduces the overall number of proteins by 43%, leading to a more concise yet representative knowledgebase.

Availability and Implementation

https://www.uniprot.org/proteomes

Contact

raposo@ebi.ac.uk

Supplementary information

Supplementary data are available at Bioinformatics online.

Version published to 10.64898/2026.05.12.720148 on bioRxiv
May 14, 2026

De novo protein discovery in non-model organisms

This article has 1 author:
1. Asif Ali
This article has no evaluationsLatest version May 13, 2026
NovoTax: prokaryotic strain identification from mass spectrometry-based proteomics data

This article has 2 authors:
1. Dennis Svedberg
2. André Mateus
This article has no evaluationsLatest version Apr 6, 2026
ECLIPSE: Exploring the dark proteome of ESKAPE pathogens through the sequence similarity network of the Protein Universe Atlas

This article has 2 authors:
1. Surabhi Lata
2. Dirk W. Heinz
This article has no evaluationsLatest version Apr 1, 2026

A novel method to select Reference Proteomes in UniProt

Discuss this preprint

Listed in

Abstract

Motivation

Results

Availability and Implementation

Contact

Supplementary information

Article activity feed

De novo protein discovery in non-model organisms

NovoTax: prokaryotic strain identification from mass spectrometry-based proteomics data

ECLIPSE: Exploring the dark proteome of ESKAPE pathogens through the sequence similarity network of the Protein Universe Atlas

Discuss this preprint

Listed in

Abstract

Motivation

Results

Availability and Implementation

Contact

Supplementary information

Article activity feed

Related articles

De novo protein discovery in non-model organisms

NovoTax: prokaryotic strain identification from mass spectrometry-based proteomics data

ECLIPSE: Exploring the dark proteome of ESKAPE pathogens through the sequence similarity network of the Protein Universe Atlas