Overcoming the widespread flaws in the annotation of vertebrate selenoprotein genes in public databases
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Selenocysteine (Sec) is a non-canonical amino acid incorporated into selenoproteins, oxidoreductase enzymes carrying essential roles in redox homeostasis. Sec insertion is performed in response to the UGA codon, normally interpreted as a stop codon, but recoded in selenoprotein mRNAs. Owing to the dual function of UGA, the identification of selenoprotein genes poses a challenge.
We show here that the vertebrate selenoprotein genes are widely misannotated in the major public databases. In Ensembl, considered the gold standard of genomic annotation, our analysis shows that only ∼10% of selenoprotein genes are well annotated; ∼10% have no annotation at all, and ∼80% have flawed annotations which lack the Sec-encoding UGA. Only model organisms have correct selenoprotein annotations in Ensembl, ascribed to manual curation. At NCBI, ∼50% of selenoproteins are misannotated, mostly corresponding to families with C-terminal Sec residues.
We argue that selenoproteins must be correctly annotated in public database and that must occur via automated pipelines, to keep the pace with genome sequencing. To facilitate this task, we present a new version of Selenoprofiles, an homology based tool for prediction of selenoproteins that can be easily deployed and produce correct predictions with accuracy comparable to manual curation.