Eukan: a fully automated nuclear genome annotation pipeline for less studied and divergent eukaryotes

Matt Sarrasin
Gertraud Burger
B. Franz Lang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Here, we introduce a new annotation pipeline, called Eukan, designed to deliver reliably high-quality results across a broad range of eukaryotes. First, experimental evidence is automatically leveraged to refine predictions, specifically, RNA-Seq coverage to inform gHMM gene prediction, and intron lengths to inform protein sequence alignments. Second, a consensus is created from an empirically optimized weighting of gene models from multiple sources. Third, Eukan runs a post-annotation routine to recover gene models missing from the consensus that otherwise have strong transcript support and appear to be protein-coding. We compare the results of Eukan with those of three popular freely-available pipelines (Maker, Braker, Gemoma) on 17 phylogenetically diverse haploid and diploid nuclear genomes. In addition to the commonly reported annotation accuracy statistics, we define a novel classification system of critical defects commonly observed in automated annotations. Furthermore, we developed a statistical model that demonstrates each of the tested pipelines correctly identified the majority of the validated “Gold Standard” gene models across the test set, but each pipeline uniquely generates a non-negligible portion of either fragmented, artificially fused, or missing gene models. Despite that, we demonstrate that Eukan performs consistently well where other pipelines encounter challenges, such as for compact protist genomes.

Contact

Matt Sarrasin; matt.sarrasin@umontreal.ca

Version published to 10.1101/2025.08.13.670088 on bioRxiv
Aug 17, 2025

Discuss this preprint

Listed in

Abstract

Contact

Article activity feed