Eukan: a fully automated nuclear genome annotation pipeline for less studied and divergent eukaryotes
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Here, we introduce a new annotation pipeline, called Eukan, designed to deliver reliably high-quality results across a broad range of eukaryotes. First, experimental evidence is automatically leveraged to refine predictions, specifically, RNA-Seq coverage to inform gHMM gene prediction, and intron lengths to inform protein sequence alignments. Second, a consensus is created from an empirically optimized weighting of gene models from multiple sources. Third, Eukan runs a post-annotation routine to recover gene models missing from the consensus that otherwise have strong transcript support and appear to be protein-coding. We compare the results of Eukan with those of three popular freely-available pipelines (Maker, Braker, Gemoma) on 17 phylogenetically diverse haploid and diploid nuclear genomes. In addition to the commonly reported annotation accuracy statistics, we define a novel classification system of critical defects commonly observed in automated annotations. Furthermore, we developed a statistical model that demonstrates each of the tested pipelines correctly identified the majority of the validated “Gold Standard” gene models across the test set, but each pipeline uniquely generates a non-negligible portion of either fragmented, artificially fused, or missing gene models. Despite that, we demonstrate that Eukan performs consistently well where other pipelines encounter challenges, such as for compact protist genomes.
Contact
Matt Sarrasin; matt.sarrasin@umontreal.ca