PSAURON: a tool for assessing protein annotation across a broad range of species
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
Evaluating the accuracy of protein-coding sequences in genome annotations is a challenging problem for which there is no broadly applicable solution. In this manuscript we introduce PSAURON (Protein Sequence Assessment Using a Reference ORF Network), a novel software tool developed to assess the quality of protein-coding gene annotations. Utilizing a machine learning model trained on a diverse dataset from over 1000 plant and animal genomes, PSAURON assigns a score to coding DNA or protein sequence that reflects the likelihood that the sequence is a genuine protein coding region. PSAURON scores can be used for genome-wide protein annotation assessment as well as the rapid identification of potentially spurious annotated proteins. Validation against established benchmarks demonstrates PSAURON’s effectiveness and correlation with recognized measures of protein quality, highlighting its potential use as a general-purpose method to evaluate gene annotation. PSAURON is open source and freely available at https://github.com/salzberg-lab/PSAURON .
One-Sentence Summary
PSAURON is a machine learning-based tool for rapid assessment of protein coding gene annotation.
Article activity feed
-
Accession numbers for all genomes can be found in Supplemental Table 1.
I know this information is relatively easy to get with the accession numbers, but it would be nice if this supplementary table already had the accession info including species name and genome/proteome quality stats for easy access.
-