Improved selection of canonical proteins for reference proteomes

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The "canonical" protein sets distributed by UniProt are widely used for similarity searching, and functional and structural annotation. For many investigators, canonical sequences are the only version of a protein examined. However, higher eukaryotes often encode multiple isoforms of a protein from a single gene. For unreviewed (UniProtKB/TrEMBL) protein sequences, the longest sequence in a Gene-Centric group is chosen as canonical. This choice can create inconsistencies, selecting >95% identical orthologs with dramatically different lengths, which is biologically unlikely. We describe the ortho2tree pipeline, which examines Reference Proteome canonical and isoform sequences from sets of orthologous proteins, builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. After examining 140,000 proteins from eight mammals in UniProtKB release 2022_05, ortho2tree proposed 7,804 canonical changes for release 2023_01, while confirming 53,434 canonicals. Gap distributions for isoforms selected by ortho2tree are similar to those in bacterial and yeast alignments, organisms unaffected by isoform selection, suggesting ortho2tree canonicals more accurately reflect genuine biological variation. 82% of ortho2tree proposed-changes agreed with MANE; for confirmed canonicals, 92% agreed with MANE. Ortho2tree can improve canonical assignment among orthologous sequences that are more than 60% identical, a group that includes vertebrates and higher plants.

Article activity feed